IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 490 forks source link

S3 URL Endpoint #4690

Closed sparrowjack63 closed 6 years ago

sparrowjack63 commented 6 years ago

Hi all,

Dataverse seems to use awscli to put data on S3 storage, but unfortunately, there is no way to put a specific S3 endpoint URL in the config file of awscli. The only and normal way to use an other S3 endpoint URL with awscl is to put it as a parameter on every awscli command like this : awscl s3 --endpoint-url=https://mys3compatiblestorage.org ...

Could you please add the opportunity to add the parameter "--endpoint-url=URL" to every awscli command used by dataverse ?

It will be very usefull for every users who have S3 compatible storage (like ceph or scality) and want to use it with dataverse.

Thanks!

pdurbin commented 6 years ago

@sparrowjack63 thanks for this suggestion. @mbamouni was asking related questions at https://groups.google.com/d/msg/dataverse-community/py0UMJV9lDg/A7Lk0YtsCQAJ and this gives me a better idea of a possible fix. Out of curiosity, would you be interested in making a pull request? 😄

sparrowjack63 commented 6 years ago

Thanks, Unfortunately, I'm not a developper, it would take me too much time to learn how to make this change :(

sparrowjack63 commented 6 years ago

In addition, I looked at the google group, and know @mbamouni ;), it's the same issue, so we could test it together with michel.

pdurbin commented 6 years ago

@sparrowjack63 ah, good to know that you and @mbamouni work together.

No worries about the pull request. It looks like the only time "awscli" appears in the git repo is in the docs (rather than the code) at https://github.com/IQSS/dataverse/blob/v4.8.6/doc/sphinx-guides/source/installation/config.rst#set-up-access-configuration-via-command-line-tools . But above that section it is described how to manually set up: https://github.com/IQSS/dataverse/blob/v4.8.6/doc/sphinx-guides/source/installation/config.rst#set-up-access-configuration-via-command-line-tools

I'm confused, though. When setting up manually it seems like one would need the ability to add "endpoint-url" to the "~/.aws/config" file. @sparrowjack63 can you please read through https://github.com/aws/aws-cli/issues/1270 and see if there's anything in there that explains what to do?

sparrowjack63 commented 6 years ago

I have read https://github.com/aws/aws-cli/issues/1270 but unfortunately, there is still no possibility to add s3 endpoint in awscli config file, it's still a pending feature request (for a long time...)

pdurbin commented 6 years ago

@sparrowjack63 ok, do you want to try the Swift API for CEPH? Here are the docs:

sparrowjack63 commented 6 years ago

As a last resort we will try swift, but it would be regrettable to close the door to all s3 compatible storage for dataverse just because of a lack on this endpoint parameter. To help us to check the possibility of integrating the use of s3 endpoint and making a pull request, could you please indicate us in the code where the aws connection is created ?

Thanks,

pdurbin commented 6 years ago

@sparrowjack63 sure, I believe all the S3 code is here: https://github.com/IQSS/dataverse/blob/v4.8.6/src/main/java/edu/harvard/iq/dataverse/dataaccess/S3AccessIO.java

If that doesn't help, please let me know what else would be helpful. Please keep asking questions. Thanks!

ariesyous commented 6 years ago

I'd also appreciate the support for alternative endpoints. I'm trying to specify a config file with endpoints at 3 different URL's (ec2, elb and rds have their own unique endpoitns).

No way to do it short of passing the --endpoint-url flag for now.

pdurbin commented 6 years ago

@aries-strato thanks for commenting. I'm new to AWS but since you mentioned RDS I thought I'd say that it's already in use by Harvard Dataverse. #4709 is about an upgrade to Postgres 9.6 in general but for anyone on RDS, they'll need to upgrade as well. Come to think of it, Harvard Dataverse is using ELB as well. And the instances run on EC2. Can you perhaps give an example of how the config would ideally look from your perspective?

ariesyous commented 6 years ago

Simply adding the field in the .config file would be enough, along with specifying a --no-verify url option as well

[default] region = cloud1 ec2_endpoint = https://cloud.net/api/ec2 elb_endpoint = https://cloud.net/api/elb rds_endpoint = https://cloud.net/api/rds verify_ssl = false

pdurbin commented 6 years ago

@aries-strato thanks. Any news from your end? I'm wondering what the status of this issue is. Are you @sparrowjack63 interested in making a pull request?

poikilotherm commented 6 years ago

Hey @pdurbin,

this really is a show stopper for our internal test deployment... Our centrally offered Swift Object Store is not GA yet and building a SAIO instance just for testing seems like overkill. (And using NFS for shared storage is the root of all evil... Let's try to avoid that in the age of 2018, right?)

Minio is a very lightweight alternative and should be sufficient for testing purposes (maybe even beyond that?) or some Ceph RADOS S3 gateway...

I just found https://github.com/minio/cookbook/blob/master/docs/aws-sdk-for-java-with-minio.md - it doesn't seem to be that hard to get the custom URL support into the Dataverse S3 storage driver class.

Wondering if you would like to get a PR for this? I would suggest using a separate storage driver class to make it more distinguishable and less error prone for existing installations... (even so it introduces some code duplication.)

(After having our testbed up and running, we can get to Payara testing... :wink: )

Cheers, Oliver

pdurbin commented 6 years ago

@poikilotherm yes! We're interested in a pull request. I just invited you to join https://github.com/orgs/IQSS/teams/dataverse-readonly so I can assign this issue to you and dragged it to the new "Community Dev" column on our kanban board: https://waffle.io/IQSS/dataverse

We're happy to answer questions here, on https://groups.google.com/forum/#!forum/dataverse-dev or http://chat.dataverse.org (#dataverse on freenode) or during community calls ( https://dataverse.org/community-calls ). Please just get in touch if there's any walk through you need of the code or docs. Thanks!

I'd feel remiss if I didn't say that we used NFS in production for about a decade and it worked fine. Only within the last year did we (Harvard Dataverse) switch to S3. 😄

poikilotherm commented 6 years ago

Thx! Joined :smile:

What approach would you guys prefer? Separate driver with custom URL prefix or an extension of the existing driver? Or should I just come up with something?

pdurbin commented 6 years ago

@poikilotherm thanks. I'll mention this issue at standup in about an hour but I guess I'm thinking a separate driver would be a good place to start. So rather than S3AccessIO.java it would be called MinioAccessIO.java maybe? I hope I'm understanding what you're building. I think it's a Minio driver.

matthew-a-dunlap commented 6 years ago

I am fairly sure supporting custom endpoints would be adding one line to the S3AccessIO.java where the client builder is initialized, to take in a custom url. That and a new config variable.

matthew-a-dunlap commented 6 years ago

@poikilotherm Feel free to ping me here if you need more info!

poikilotherm commented 6 years ago

@matthew-a-dunlap sure, will do so. Maybe you can give #5068 a look, too? This might be a dependency for this...

poikilotherm commented 6 years ago

Hey @matthew-a-dunlap and @pdurbin,

I think I got this ready for review.

Beware of the monsters in the shadows:

  1. I could only test this with Minio, not with AWS, as I do not have an AWS account.
  2. As I wrote the unit test in JUnit5, this is blocked by #2940. That's why the branch is based on that feature branch, too.
  3. As there is no such things as Arquillian in place right now, there is NO integration test. IMHO this is not really possible right now, as it would need a changed docker container with Minio or something else inside. Arquillian to the rescue... :wink:
  4. As I am great JetBrains IDEA fan, I smuggled two commits into this feature branch that adds some stuff to gitignore and fixed the Maven model. Hope you guys don't mind. If you do, I can split and open another PR.
poikilotherm commented 6 years ago

@sparrowjack63 and @mbamouni, could you test my feature branch with your S3 services and report back? Thank you!!!

pdurbin commented 6 years ago

@poikilotherm thanks! Pull request #5059 is the one to look at, right? (The description still says "do not merge".) I moved this issue over to code review at https://waffle.io/IQSS/dataverse

poikilotherm commented 6 years ago

Sorry, missed the description. Thank for pointing that out, fixed. :smile_cat:

matthew-a-dunlap commented 6 years ago

Hey @poikilotherm , do you mind giving us access to commit on your fork? Mainly so our documentation writer and co. can proof the documentation changes alongside the s3 work: http://guides.dataverse.org/en/latest/developers/version-control.html#adding-commits-to-a-pull-request-from-a-fork

poikilotherm commented 6 years ago

Hi @matthew-a-dunlap you guys should already have access to the pull request branch by default, haven't you? Or are you asking me to enable push access for @IQSS for the whole fork?

matthew-a-dunlap commented 6 years ago

@poikilotherm I was running into issues getting to the point where i could push to the branch, but it may have been a problem on my end

poikilotherm commented 6 years ago

It definitly is enabled... grafik

poikilotherm commented 6 years ago

Hey @sparrowjack63 and @mbamouni,

I would really appreciate if you could test this feature branch and give us some feedback. If this is no longer relevant to you, maybe you can drop a comment anyway? Thanks!!!

poikilotherm commented 6 years ago

Hey @djbrooke, @dlmurphy and @matthew-a-dunlap (and maybe @pdurbin ?),

could you guys please approve my PR #5059 so it might get merged soon? That would be really awesome! If there is anything missing or if anything needs to be changed, just drop me a line.

Thank you! You're awesome building Dataverse! :smile: :+1:

pdurbin commented 6 years ago

@poikilotherm our convention is that while an issue is in Code Review at https://waffle.io/IQSS/dataverse the assignees on the issue indicate who is still doing work on it. So if you're done working on the issue, can I unassign you? It will make the status of the issue more clear at standup in about an hour.

@dlmurphy you're still assigned as well from back when you self-assigned last week. I assume you'll take yourself off when you're done making changes.

dlmurphy commented 6 years ago

I'm all set, moving this to QA.

kcondon commented 6 years ago

The underlying PR is blocked so awaiting clarification on status.

djbrooke commented 6 years ago

Hi @poikilotherm - the PR says it's blocked by #5068, but I don't think this is actually the case. Can you confirm?

poikilotherm commented 6 years ago

Hi @djbrooke, this is true - it's not really blocked, only connected. Sry for the noise, will update the description.

djbrooke commented 6 years ago

Thanks @poikilotherm - I'll move it over to QA.

poikilotherm commented 6 years ago

Hey @kcondon, is there any chance this could be merged anytime soon? Thx!

kcondon commented 6 years ago

@poikilotherm Sorry, I have been a bit backlogged. I'm working on this now.

kcondon commented 6 years ago

@poikilotherm Hi, I've regression tested this against s3 and that part is working. I've tried setting up against the public minio url but was not able to get it working. I know it must be something simple I'm missing but what are the minimum set of options that a minio config needs? Also, should any existing s3 config options be disabled? Thanks.

kcondon commented 6 years ago

@poikilotherm Got it working -typo on id in guide and I did not create the bucket first. Phil updated the doc for those small changes.

poikilotherm commented 6 years ago

Thanks for fixing the docs @pdurbin and thanks for merging @kcondon!

Glad this is in upstream now :tada:

pdurbin commented 6 years ago

@poikilotherm sure. @kcondon and I were talking about improvements we could make to the docs in the future but we didn't want to hold up merging just for docs. Thank you for your contribution!

poikilotherm commented 6 years ago

@pdurbin I just saw this issue is not assigned to a milestone... Should this be added to a 4.9.5 or 4.10 release?

pdurbin commented 6 years ago

@poikilotherm if I knew for sure we were planning to release 4.9.5 I would add it there but I don't what the next release will be or when. A number of 4.x releases are mentioned at https://dataverse.org/goals-roadmap-and-releases