dcppc / data-stewards

Questions and answers about TOPmed, GTEx, and AGR resources.
8 stars 0 forks source link

Full stacks make recommendations on minimal manifest file content #9

Open mellybelly opened 6 years ago

mellybelly commented 6 years ago

For example, versioning information, schema location and version, schema documentation/UML diagram etc.

AGR provided the attached file as an example. alliance_file_manifest.txt

It would be great to have other examples too.

cmungall commented 6 years ago

Note to consumers of the manifest

This URL has to be prepended to all files referenced: https://s3.amazonaws.com/mod-datadumps/

E.g. for FB_1.0.4_4.tar.gz, this is the URL: https://s3.amazonaws.com/mod-datadumps/FB_1.0.4_4.tar.gz

In future this can be made more explicit by using a standard bdbag distribution or similar

mikedarcy commented 6 years ago

The file SGD_1.0.4_1.tar.gz as listed in the example manifest does not exist at https://s3.amazonaws.com/mod-datadumps/SGD_1.0.4_1.tar.gz and a 404 error is returned when trying to retrieve this file.

In addition, there are no MD5 (or SHA-256, etc) checksums associated with any of the files listed in the manifest, nor are there checksums associated with the S3 objects themselves, making it impossible to verify the data integrity of these files. The authoritative issuer of these files should provide this information somehow, whether it be in the example manifest or (preferably), associated directly with the S3 objects as MD5 checksums.

For S3 object uploads, the Content-MD5 request header should be set in the file PUT request. The value of this header is The base64-encoded 128-bit MD5 digest of the file data. For more information, see: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html and https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/.

christabone commented 6 years ago

@cmungall This is not necessarily correct, the files which were submitted do not always resolve to the S3 bucket. A few (~4-5) of the files were renamed for consistency and clarity before submission (the content of the files is the same). This will be fixed with the new data submission system for the next release.

@mikedarcy Including checksums is an excellent idea, we will look into this for our next release.

ianfoster commented 6 years ago

Here is the BDBag that we created for the manifest: http://n2t.net/minid:b94x3r.

(This is an earlier BDBag, which is marked as obsoleted as the contents of one of the files referenced in the manifest was updated--demonstrating the importance of using BDBags to capture the specific versions of files that we are working with: http://n2t.net/minid:b9j69h.)

jmcherry-zz commented 6 years ago

We are working to automate addition to AWS, then copy to GCP. BDBag part of that, I believe.

christabone commented 6 years ago

This is correct. The data submission system is nearing completion and we've started discussions regarding the use of BDBags as well. @ianfoster Very nice work, I'll pass on the link to the rest of the Alliance group.