Open mellybelly opened 6 years ago
Note to consumers of the manifest
This URL has to be prepended to all files referenced: https://s3.amazonaws.com/mod-datadumps/
E.g. for FB_1.0.4_4.tar.gz
, this is the URL:
https://s3.amazonaws.com/mod-datadumps/FB_1.0.4_4.tar.gz
In future this can be made more explicit by using a standard bdbag distribution or similar
The file SGD_1.0.4_1.tar.gz
as listed in the example manifest does not exist at https://s3.amazonaws.com/mod-datadumps/SGD_1.0.4_1.tar.gz
and a 404 error is returned when trying to retrieve this file.
In addition, there are no MD5 (or SHA-256, etc) checksums associated with any of the files listed in the manifest, nor are there checksums associated with the S3 objects themselves, making it impossible to verify the data integrity of these files. The authoritative issuer of these files should provide this information somehow, whether it be in the example manifest or (preferably), associated directly with the S3 objects as MD5 checksums.
For S3 object uploads, the Content-MD5
request header should be set in the file PUT request. The value of this header is The base64-encoded 128-bit MD5 digest of the file data
. For more information, see: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html and https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/.
@cmungall This is not necessarily correct, the files which were submitted do not always resolve to the S3 bucket. A few (~4-5) of the files were renamed for consistency and clarity before submission (the content of the files is the same). This will be fixed with the new data submission system for the next release.
@mikedarcy Including checksums is an excellent idea, we will look into this for our next release.
Here is the BDBag that we created for the manifest: http://n2t.net/minid:b94x3r.
(This is an earlier BDBag, which is marked as obsoleted as the contents of one of the files referenced in the manifest was updated--demonstrating the importance of using BDBags to capture the specific versions of files that we are working with: http://n2t.net/minid:b9j69h.)
We are working to automate addition to AWS, then copy to GCP. BDBag part of that, I believe.
This is correct. The data submission system is nearing completion and we've started discussions regarding the use of BDBags as well. @ianfoster Very nice work, I'll pass on the link to the rest of the Alliance group.
For example, versioning information, schema location and version, schema documentation/UML diagram etc.
AGR provided the attached file as an example. alliance_file_manifest.txt
It would be great to have other examples too.