bihealth / sodar-server

SODAR: System for Omics Data Access and Retrieval
https://github.com/bihealth/sodar-server
MIT License
14 stars 3 forks source link

Add support for landing zone manifest files #873

Open mikkonie opened 4 years ago

mikkonie commented 4 years ago

From the original issue by @holtgrewe:

Further, it would be useful to have support for "MANIFEST.txt" files that lists all files to be expected in the landing zone. This would allow for upload tools to append to this file first and then upload the files. Missing file+.md5 file uploads could thus be detected.

holtgrewe commented 2 years ago

I would suggest something along the following spec.

What is the addressed problem? We need to assure that the user's uploaded files are (a) correct, i.e., the uploaded content is identical with the source content, and (b) complete, i.e., there is no missing or extra uploaded file.

What is the current status? We allow for uploading $file.md5 files together with the actual $file. This addresses problem (a) and is very useful for incremental file uploads.

However, we do not address (b) yet.

What does your proposed solution look like? Optionally, users should be able to upload a SODAR_MANIFEST file. This file would contain the MD5 sums and full paths. In this case, all .md5 files become optional. When they are included, their content will be compared to the corresponding file's content anyway.

When transferring the landing zone with a manifest file, the files in the landing zone will be compared to the files in the manifest and the set of both files have to be equal. Further, the corresponding MD5 checksums in the manifest file are compared to the MD5 checksums in iRODS. If everything works out, the files are transferred and the $file.md5 files are created automatically based on the checksums in the manifest.

The manifest file itself is not included in the target iRODS storage.

Example for creating a manifest file.

# mkdir -p subdir
# head -c 100 /dev/random >subdir/foo
# head -c 100 /dev/random >subdir/bar
# for i in $(find . -type f | sort); do md5sum $i >>SODAR_MANIFEST; done
# cat SODAR_MANIFEST 
d307d8a9635109d760c9e1dc90a54bad  ./subdir/bar
6a139e890ddcc8c2ae235c71ca772881  ./subdir/foo