Dataverse configuration to use AWS S3 storage not working

gupta-yash commented 6 years ago

Hi everyone,

I'm configuring my local Dataverse instance (Dataverse ver. 4.7.1) to work with Amazon S3 storage for storing my datasets. I was following the standard configuration steps as mentioned under the Dataverse installation guide: http://guides.dataverse.org/en/4.8/installation/config.html#file-storage-local-filesystem-vs-swift-vs-s3

I've correctly updated the storage driver environment variables for S3 storage & provided the AWS Access Key ID & Secret Access Key under AWS Configuration step.

But after configuring everything (& restarting GlassFish server), when I uploaded my dataset on the Dataverse dashboard, it isn't showing up in my S3 bucket. Additionally, when I tried to download that dataset back to my machine, it displayed the following error:

Internal Server Error- An unexpected error was encountered, no more information is available.

Let me clear one thing here, that the file is successfully getting uploaded to DataVerse (it is showing the file), but it is not being reflected into the S3 bucket.

Can someone help me as in where I'm going wrong? Thanks in advance.

pdurbin commented 6 years ago

@yashgupta-ais thanks for creating this issue and kicking off this conversation at https://groups.google.com/d/msg/dataverse-community/G3Ssv7HOrns/ptpBWA5qAwAJ

I took a look at the server.log file you attached at https://groups.google.com/d/msg/dataverse-community/G3Ssv7HOrns/BblqbXZwAwAJ and these look fine:

-Ddataverse.files.storage-driver-id=s3
-Ddataverse.files.s3-bucket-name=yash-dvtestbucket

The error that jumped out at me is the following:

java.io.IOException: Failed to open local file file:///usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/0E1LJY/15f392e6c63-e86d21339169

Rather than file:// I believe you should be seeing s3://.

Can you please take a look at the storageidentifier column in the dvobject table? Here's the schema if that's helpful: http://phoenix.dataverse.org/schemaspy/latest/tables/dvobject.html

If my recollection is correct, storageidentifier should show s3:// when you upload to S3.

Can you please paste the output of select id,dtype,storageidentifier from dvobject;?

pdurbin commented 6 years ago

@yashgupta-ais this seems like the biggest problem. The file is never saved to S3 and this error is shown in your server.log file:

[2017-10-20T07:29:58.188+0000] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.ingest.IngestServiceBean] [tid: _ThreadID=29 _ThreadName=http-listener-1(4)] [timeMillis: 1508484598188] [levelValue: 900] [[ Failed to save the file, storage id 15f38af2dda-93bead1d6fe2 (createDataAccessObject: Unsupported storage method s3)]]

pdurbin commented 6 years ago

That error seems to be coming from here: https://github.com/IQSS/dataverse/blob/v4.8.1/src/main/java/edu/harvard/iq/dataverse/dataaccess/DataAccess.java#L105

gupta-yash commented 6 years ago

@pdurbin It seems that the query output that you asked for jumped back on me in a pretty weird way.

The reason being, when I executed the query: select id,dtype,storageidentifier from dvobject; It displayed that the column storageidentifier doesn't exist in the table. (Don't know how could it be possible)

Here's the output for select * from dvobject;

pdurbin commented 6 years ago

@yashgupta-ais in Dataverse 4.8 the storageidentifier column was moved from the datafile table to the dvobject table. See the upgrade_v4.7.1_to_v4.8.sql upgrade script and the upgrade instructions at https://github.com/IQSS/dataverse/releases/tag/v4.8

S3 isn't supported until Dataverse 4.8 so you'll need to upgrade before it works.

pdurbin commented 6 years ago

I see that https://help.hmdc.harvard.edu/Ticket/Display.html?id=254472 was closed so I'm going to close this issue as well.

@yashgupta-ais please let us know if you'd like to re-open either issue. We are interested in feedback on the new S3 feature so after you upgrade to 4.8 please let us know what you think!

gupta-yash commented 6 years ago

@pdurbin I'm working on DataVerse 4.8 now, and this time when I configured it for S3 storage, I got the error 'The dataset cannot be created' with a 404 Page Not Found screen:

Here is the latest GlassFish server log file: server.log

And as far as I could understand, it says that, a TransactionRolledBackLocalException is occuring. caused by: com.amazonaws.AmazonClientException: Cannot load the credentials from the credential profiles file. Please make sure that your credentials file is at the correct location (~/.aws/credentials), and is in valid format.

Now, as per your DataVerse Installation guide, we can either set up the AWS credentials manually or using awscli. And, I've tried both the ways (with the credentials being in the format mentioned in your guide), but I ended up with the same trouble.

Can you please reopen the issue? Thanks.

pdurbin commented 6 years ago

@yashgupta-ais thanks, I'm seeing this:

Caused by: com.amazonaws.AmazonClientException: Cannot load the credentials from the credential profiles file. Please make sure that your credentials file is at the correct location (~/.aws/credentials), and is in valid format.
    at edu.harvard.iq.dataverse.dataaccess.S3AccessIO.<init>(S3AccessIO.java:73)

...

Caused by: java.lang.IllegalArgumentException: profile file cannot be null
    at com.amazonaws.util.ValidationUtils.assertNotNull(ValidationUtils.java:37)
    at com.amazonaws.auth.profile.ProfilesConfigFile.<init>(ProfilesConfigFile.java:142)
    at com.amazonaws.auth.profile.ProfilesConfigFile.<init>(ProfilesConfigFile.java:133)
    at com.amazonaws.auth.profile.ProfilesConfigFile.<init>(ProfilesConfigFile.java:100)
    at com.amazonaws.auth.profile.ProfileCredentialsProvider.getCredentials(ProfileCredentialsProvider.java:135)
    at edu.harvard.iq.dataverse.dataaccess.S3AccessIO.<init>(S3AccessIO.java:70)
    ... 125 more

https://github.com/IQSS/dataverse/blob/v4.8/src/main/java/edu/harvard/iq/dataverse/dataaccess/S3AccessIO.java#L70

What user is glassfish running as?

What is the home directory of the user that glassfish his running as?

What is the fully qualified path to your .aws directory?

gupta-yash commented 6 years ago

Hi @pdurbin , here are the details you asked for:

User: yash (GlassFish script executed in sudo mode) Home directory of the user that GlassFish is running as: /home/yash/ Fully Qualified Path to .aws directory: /home/yash/.aws

ferrys commented 6 years ago

@yashgupta-ais It is possible it has something to do with the way we initialize our S3 client. From https://stackoverflow.com/questions/41796355/aws-error-downloading-object-from-s3-profile-file-cannot-be-null, this "won't work anywhere that you're getting AWS access through an IAM role (ex. Lambda, Docker, EC2 instance, etc)." Is this true in your case?

Since we haven't been able to reproduce it on our end, I'm thinking it likely has something to do with either the environment that you're running Dataverse in or the way with which you are accessing AWS.

landreev commented 6 years ago

Could you please clarify the "script executed in sudo mode" part? If that means that you run sudo as a regular user, but glassfish ends up running under root user, then the .aws config directory must be in /root/.aws (or whatever the home directory of the root user is).

I do understand that you probably meant that you have an init script, like /etc/init.d/glassfish, that starts as root, but then switches to yash via sudo... In this case your .aws directory is in the seemingly correct location... Just wanted to verify this.

OK, just to be certain - could you please send the output of ps awux | grep glassfish - just to make sure...

And the output of ls -lat /home/yash/.aws

Thank you.

gupta-yash commented 6 years ago

Hi @ferrys Actually I've set up my Dataverse (v4.8) inside a Linux VM (Ubuntu 14.10) running on Microsoft Azure. And, yes I'm accessing AWS through IAM role (by generating the keys for DataVerse & placing them in the credentials file), because I was exactly following your guide (http://guides.dataverse.org/en/latest/installation/config.html#file-storage-local-filesystem-vs-swift-vs-s3) where it's clearly mentioned that: If you have created a user in AWS IAM, you can click on that user and generate the keys needed for dataverse. Once you have acquired the keys, they need to be added tocredentials

And I was probably confident that there shouldn't be any issue whether I use the keys to access AWS S3 on my local machine or on any VM connected through RDP. If this is the actual cause for my trouble, please guide me as in how should I proceed with the troubleshooting?

Thanks.

gupta-yash commented 6 years ago

Hi @landreev The point that I was trying to put here was that the GlassFish is running in sudo as a regular user (yash) and not under root user.

Here's the output of ps awux | grep glassfish : output_1.txt

Here's the output of ls -lat /home.yash/.aws : output_2.txt

pdurbin commented 6 years ago

Here's the output of ps awux | grep glassfish : output_1.txt

@yashgupta-ais to me it looks like Glassfish is running as root:

root 15929 0.2 8.0 11444068 2327376 pts/9 Sl Oct24 5:39 /usr/lib/jvm/java-8-oracle/bin/java -cp /usr/local/glassfish4/glassfish/modules/glassfish.jar...

It's less secure to run Glassfish as root, but for now if you just want to try to get S3 support working, you could try moving your AWS credentials file to root home directory, to /root/.aws/credentials

gupta-yash commented 6 years ago

@pdurbin Thank you for pointing that out! Huh, I'm still confused as in how come the GlassFish was running as root even though I performed everything in sudo as yash & always started it as yash role only.

Anyways, I followed your steps further on, placed the credentials file into /root/.aws/credentials, gave the GlassFish a restart & gave it a try, and I again confronted the Unable to create dataset issue alongwith the same old 404 Page Not Found page.

Here's the updated server log file: server.log

Now, when I checked my S3 bucket, I saw that something got uploaded into my bucket at the exact same time when I tried creating the dataset, but that 'something' appeared as a weird file (with unknown file format) inside some weird directory structure on S3 console (underlined red):

The dataset didn't get created in the DataVerse, but I got something in to my bucket... that's strange.

Second thing, the server log said that it's some Solr-related issue, some sort of Indexing error. But I didn't get it.

Help, please? Thanks.

pdurbin commented 6 years ago

that 'something' appeared as a weird file (with unknown file format) inside some weird directory structure on S3 console (underlined red)

This is completely normal. When you upload a file to Dataverse, it scrambles the name. If you compare checksums (md5 or similar) of the file you uploaded and the file that's now in S3, it should be the same.

The error you're getting is unknown field 'fileChecksumType'. Please make sure you are using the Solr schema that is distributed in the dvinstall.zip file you downloaded. There are directions for putting the Solr schema.xml file into place at http://guides.dataverse.org/en/4.8/installation/prerequisites.html#solr

Overall, this sounds like progress! Great!

gupta-yash commented 6 years ago

@pdurbin Well, then I guess there's no trouble with the 'weird' file names & directory structure.

Talking about the unknown field 'fileChecksumType error, I'd like to let you know that I'm using the schema file (schema.xml) that is distributed alongwith the official DataVerse installation zip file (dvinstall.zip), as this step was clearly mentioned in your manual (Solr setup section under Prerequisites) & there was no doubt in it.

But still, the dataset is not being created in the DataVerse, but the data is being reflected onto my S3 portal. How should I proceed now?

pdurbin commented 6 years ago

@yashgupta-ais can you please paste the output of http://localhost:8983/solr/schema/fields ? fileChecksumType and many other Dataverse-specific fields should appear in this output.

gupta-yash commented 6 years ago

@pdurbin Here's the output of http://localhost:8983/solr/schema/fields Solr_output.txt

pdurbin commented 6 years ago

Huh, sure enough fileChecksumType isn't in the output. Weird. At some point let's track down which "dvinstall.zip" file you downloaded. It seems like you've bounced around between 4.7 and 4.8 and maybe 4.8.1. Regardless, the latest version of the Solr schema.xml should work, so can you please stop Solr, update the schema.xml file with the "raw" version from https://github.com/IQSS/dataverse/blob/v4.8.1/conf/solr/4.6.0/schema.xml , start Solr, and then check if fileChecksumType is now in the output at http://localhost:8983/solr/schema/fields ?

gupta-yash commented 6 years ago

@pdurbin I performed each and every step that you asked for, and this time, the dataset got created in the DataVerse, AND it also got uploaded to my S3 bucket. So, considering for the issue, it's an overall SUCCESS!

Now, there's only one thing that's troubling me, and that is, when I'm trying to download the datafile (that I uploaded into my dataset) back to my machine, it's showing an "Internal Server Error" screen:

When I tried to download it:

Is there some explanation for this situation? Or am I back into some trouble? Thanks.

pdurbin commented 6 years ago

@yashgupta-ais it sounds like both progress AND trouble! 😄

Can you please upload your "server.log" file? Thanks.

gupta-yash commented 6 years ago

@pdurbin Well, I'm in 'half-happy-half-sad' situation right now. Here's the most recent server log file: server.log

Thanks,

pdurbin commented 6 years ago

@yashgupta-ais thanks. It say Error processing /api/v1/access/datafile/7 followed by com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied...

Is the AWS password correct? To you want to check and then try creating a new dataset with new files to see if you get the same behavior?

For the benefit of @landreev and others looking at this issue, here's more of the stack trace (Dataverse 4.8):

Error processing /api/v1/access/datafile/7: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: EFFCAB1A44361645; S3 Extended Request ID: uP7xTEegil9Pqw57VoTJSfYkKR0xdcfHFhZe8kEqkHCMM6SISNT4UoJyxARhUK6p+ZRdi8SdWms=), S3 Extended Request ID: uP7xTEegil9Pqw57VoTJSfYkKR0xdcfHFhZe8kEqkHCMM6SISNT4UoJyxARhUK6p+ZRdi8SdWms= javax.servlet.ServletException: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: EFFCAB1A44361645; S3 Extended Request ID: uP7xTEegil9Pqw57VoTJSfYkKR0xdcfHFhZe8kEqkHCMM6SISNT4UoJyxARhUK6p+ZRdi8SdWms=), S3 Extended Request ID: uP7xTEegil9Pqw57VoTJSfYkKR0xdcfHFhZe8kEqkHCMM6SISNT4UoJyxARhUK6p+ZRdi8SdWms=

    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:391)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:381)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:344)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221)
    at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1682)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:344)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
    at org.ocpsoft.rewrite.servlet.RewriteFilter.doFilter(RewriteFilter.java:205)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
    at edu.harvard.iq.dataverse.api.ApiBlockingFilter.doFilter(ApiBlockingFilter.java:162)

...

Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: EFFCAB1A44361645; S3 Extended Request ID: uP7xTEegil9Pqw57VoTJSfYkKR0xdcfHFhZe8kEqkHCMM6SISNT4UoJyxARhUK6p+ZRdi8SdWms=), S3 Extended Request ID: uP7xTEegil9Pqw57VoTJSfYkKR0xdcfHFhZe8kEqkHCMM6SISNT4UoJyxARhUK6p+ZRdi8SdWms=

    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4227)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4174)
    at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1382)
    at edu.harvard.iq.dataverse.dataaccess.S3AccessIO.open(S3AccessIO.java:127)
    at edu.harvard.iq.dataverse.api.DownloadInstanceWriter.writeTo(DownloadInstanceWriter.java:70)
    at edu.harvard.iq.dataverse.api.DownloadInstanceWriter.writeTo(DownloadInstanceWriter.java:40)

ferrys commented 6 years ago

@yashgupta-ais Since the error is happening in isAuxObjectCached, the issue is the s3.doesObjectExist call to AWS. Since you're uploading a tabular file, can you confirm that there is a file that exists with a .orig extension in S3?

To be fair, this should be returning False if it doesn't exist, but it may be a weird bug.

For reference: Caught an AmazonServiceException in S3AccessIO.isAuxObjectCached: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: AB0E4CFD6154CD83; S3 Extended Request ID: rXWzj7hkeE//JlPFR3ldyc4e4WZ35ehL7KW3aYizzd23YF64sOK7DjCRAt7yZ07sg7+xiNDYCPo=)]]

EDIT: It seems like this is a feature (https://github.com/aws/aws-sdk-java/issues/974), so I believe the issue is your listing permissions.

gupta-yash commented 6 years ago

@pdurbin I rechecked the AWS credentials and everything seemed to be correct and in place.

And, when I tried creating a new dataset with a simple .txt document, I'm still getting the same behaviour.

gupta-yash commented 6 years ago

@ferrys As per your instructions, I checked for in my S3 bucket, and I found that there exists no file with a .orig extension corresponding to the dataset I created, just an unknown format file, as always:

pdurbin commented 6 years ago

And, when I tried creating a new dataset with a simple .txt document, I'm still getting the same behaviour.

@ferrys what do you think about this? Problems with a non-tabular file, even.

I'm out of ideas a the moment. @landreev @kcondon what do you think?

djbrooke commented 6 years ago

Hey @yashgupta-ais - we're happy to help you keep troubleshooting this, but I think it may make sense to do a fresh install and set all of this up from the beginning. What do you think about this plan?

rc-sea commented 6 years ago

hey guys wanted to say thanks for the help. we agree that it's probably best to start from scratch, however at this point we're going to turn our attention to working on the Azure Blob storage driver, since that was the point of the S3 exercise to begin with. I'm sure Yash has learned a lot of things to be aware of ;)

djbrooke commented 6 years ago

Thanks @drizlrc - I'll close this out for now.

pdurbin commented 6 years ago

@drizlrc thanks, we'll keep an eye on #4247 for the Azure Blob storage driver!

IQSS / dataverse

Dataverse configuration to use AWS S3 storage not working #4223