bihealth / sodar-server

SODAR: System for Omics Data Access and Retrieval
https://github.com/bihealth/sodar-server
MIT License
14 stars 3 forks source link

Add checksum-based duplicate file checks for landing zone uploads #1859

Open mikkonie opened 6 months ago

mikkonie commented 6 months ago

Requested by @holtgrewe. A user may accidentally upload a file to a landing zone that already exists in the project's sample repository, but with a different file name and/or under a different collection. Currently, this is not checked so duplicates can be uploaded. This should be preventable, at least for large files.

This task requires #1846 to be done first.

Spec (To Be Refined)

Tasks

Nicolai-vKuegelgen commented 3 months ago

In addition to the file_size check it might be worthwhile to have different behaviour based on file type: i.e. only apply the check to specific file types (like fastq.gz / bam); or only raise WARN but not DENY for certain file types.

However, in practice the file size distinction will likely already catch most cases where duplicate files are not uploaded in error (like log files), so this might not be a necessary addition.