BD2KGenomics / dcc-redwood-client

Apache License 2.0
2 stars 1 forks source link

Upload s3 keyerror on big files #18

Closed caaespin closed 7 years ago

caaespin commented 7 years ago

I'm seeing errors when trying to upload big files (>25GB) to redwood. It faults because of some amazon S3 error. Here is a sample stack trace:

2017-05-16 00:48:07,979 INFO main:890 number of metadata files written: 1
2017-05-16 00:48:07,979 INFO main:899 Uploading files.
2017-05-16 00:48:07,979 INFO main:900 NOTE: If it hangs IP may be blocked
2017-05-16 00:51:03,588 INFO main:950 counts    {"bundlesFound": 1}
2017-05-16 00:51:03,589 INFO register_upload:536 registering upload redwood metadata: dcc-metadata-client -m /outputs/registration.tsv -o /outputs/redwoodUploadManifest
2017-05-16 00:51:10,220 INFO perform_upload:552 performing upload: icgc-storage-client upload --manifest /outputs/redwoodUploadManifest/bb2cb198-ea92-596a-a55e-3d5588f81c2a --force
2017-05-16 00:57:22,981 ERROR perform_upload:558 error while uploading files
2017-05-16 00:57:22,981 ERROR writeJarExceptionsToLog:570 ERROR: Data transfer has been interrupted. Some parts are missing. Waiting to retry...
2017-05-16 00:57:22,981 ERROR writeJarExceptionsToLog:570 ERROR: Command error: java.io.IOException: Storage server error: {"timestamp":1494896242910,"status":503,"error":"Service Unavailable","exception":"org.icgc.dcc.storage.server.exception.RetryableException","message":"com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 0ED4BBFED7D8501D), S3 Extended Request ID: z39r9EH1q6ANIqqrlT4FJnQg+M8iiHbboXmncBVKYTNHr+YjO/sFst8b+eboTLAp93RnfX6bXPg=","path":"/upload/92271568-7499-566c-b9e6-eac7e9aeee09"}
2017-05-16 00:57:22,982 ERROR main:958 redwood upload failed 

This was trying to upload from a box in virginia to a bucket in virginia.

caaespin commented 7 years ago

More logs to make it more helpful; spinnaker.log:

2017-05-16 00:44:56,972 DEBUG main:829 options: {'submissionServerUrl': 'http://storage2.ucsc-cgl.org:8460', 'verbose': False, 'skip_upload': False, 'skip_submit': True, 'receiptFile': 'receipt.tsv', 'redwood_registration_file': 'registration.tsv', 'metadataOutDir': '/outputs', 'metadataSchemaFileName': 'schemas/metadata_schema.json', 'test': False, 'force_upload': True, 'inputMetadataSchemaFileName': 'schemas/input_metadata.json'}
2017-05-16 00:44:56,972 DEBUG main:830 args:    ['/dcc/data/manifest.short.tsv']
2017-05-16 00:44:56,973 DEBUG getDataDictFromXls:307 Attempt to read /dcc/data/manifest.short.tsv as xls file
2017-05-16 00:48:07,978 INFO main:884 validated all metadata objects for output
2017-05-16 00:48:07,979 INFO main:890 number of metadata files written: 1
2017-05-16 00:48:07,979 INFO main:899 Uploading files.
2017-05-16 00:48:07,979 INFO main:900 NOTE: If it hangs IP may be blocked
2017-05-16 00:48:07,980 DEBUG main:928 found bundle directory at /outputs/bb2cb198-ea92-596a-a55e-3d5588f81c2a
2017-05-16 00:51:03,588 INFO main:950 counts    {"bundlesFound": 1}
2017-05-16 00:51:03,589 INFO register_upload:536 registering upload redwood metadata: dcc-metadata-client -m /outputs/registration.tsv -o /outputs/redwoodUploadManifest
2017-05-16 00:51:10,220 INFO perform_upload:552 performing upload: icgc-storage-client upload --manifest /outputs/redwoodUploadManifest/bb2cb198-ea92-596a-a55e-3d5588f81c2a --force
2017-05-16 00:57:22,981 ERROR perform_upload:558 error while uploading files
2017-05-16 00:57:22,981 ERROR writeJarExceptionsToLog:570 ERROR: Data transfer has been interrupted. Some parts are missing. Waiting to retry...
2017-05-16 00:57:22,981 ERROR writeJarExceptionsToLog:570 ERROR: Command error: java.io.IOException: Storage server error: {"timestamp":1494896242910,"status":503,"error":"Service Unavailable","exception":"org.icgc.dcc.storage.server.exception.RetryableException","message":"com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 0ED4BBFED7D8501D), S3 Extended Request ID: z39r9EH1q6ANIqqrlT4FJnQg+M8iiHbboXmncBVKYTNHr+YjO/sFst8b+eboTLAp93RnfX6bXPg=","path":"/upload/92271568-7499-566c-b9e6-eac7e9aeee09"}
2017-05-16 00:57:22,982 ERROR main:958 redwood upload failed
benjaminran commented 7 years ago

The corresponding storage client and server logs are attached (with INFO log level, unfortunately). This error apparently occurred because the storage-server tried to access a statestore key that didn't exist. The lookup was retried (exponentially); the key really wasn't in s3.

I expected the s3 key to contain more than just the two components (/upload/9227...) because of S3UploadStateStore.java (the first statestore method called in responding to a /upload/{object-id} request) is UploadStateStore.isCompleted, which calls UploadStateStore.read, which reads from s3 the key returned by UploadStateStore.getUploadStateKey. This may be a bug in the way the statestore key is generated aws sdk is truncating the key to the nonexistent prefix (which the aws s3 cli doesn't do), something is not as I expected here.

I'd recommend proceeding by reproducing the issue by uploading a large (i.e. 25G) file to a redwood instance with the storage-server (and maybe all redwood servers) having log level set to DEBUG. Then look for the debug messages from S3UploadService.java:finalizeUpload. It would help to run the client with DEBUG log level too (see the dcc-ops/redwood wiki for tips on this). Remote debugging the server or client is always an option, but may not add much here.

nosuchkey.zip

benjaminran commented 7 years ago

Note: if getUploadStateKey was indeed used to generate the s3 key (it would be quite a surprise if this isn't the case), then 977... should be an object id

benjaminran commented 7 years ago

https://stackoverflow.com/a/31800918/5673643

caaespin commented 7 years ago

Looked at the files and I suspect we can debug on the files in this folder.

https://github.com/BD2KGenomics/dcc-storage/blob/develop/dcc-storage-client/src/main/java/org/icgc/dcc/storage/client/transport/ParallelPartObjectTransport.java