BU-ISCIII / relecov-tools

set of helper tools for the assembly of the different elements in the RELECOV platform (Spanish Network for genomic surveillance of SARS-Cov-2) as data download, processing, validation and upload to public databases, as well as analysis runs and database storage.
GNU General Public License v3.0
5 stars 21 forks source link

Include a test for file integrity somewhere in the workflow #276

Closed Shettland closed 1 month ago

Shettland commented 4 months ago

Even though the md5 is checked when a file is downloaded, it could be corrupted from the beggining. In those cases, since the md5 is the same before and after transfer, it is not recognized as corrupted.

This test might be better implemented for ".gz" files in download module Pseudocode:

import gzip
chunksize=10000000 #(10mb)

with gzip.open(file_to_test, 'rb') as f:
    while f.read(chunksize):
         pass
return True

This will raise an exception if its not gzipped or corrupted

Shettland commented 1 month ago

Addressed in #313