genome-in-a-bottle / giab_data_indexes

This repository contains data indexes from NIST's Genome in a Bottle project.
232 stars 71 forks source link

MD5 checksums don't match for `HG002 Illumina 2x150bp` #19

Closed Faizal-Eeman closed 1 year ago

Faizal-Eeman commented 1 year ago

I've tried downloading a few FASTQ files listed in sequence.index.AJtrio_Illumina300X_wgs_07292015.HG002 and found the MD5 checksums listed here don't match with the downloaded files.

commands used:

$ wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/Project_RM8391_RM8392/Sample_2A1/2A1_CGATGT_L001_R1_001.fastq.gz

$ md5sum 2A1_CGATGT_L001_R1_001.fastq.gz
c2ae5e412fb211974f9a9a46a5392428  2A1_CGATGT_L001_R1_001.fastq.gz

MD5 checksum listed for the same file from the same library is 48e52acfce7548bddad2b3f89e8e0348 https://github.com/genome-in-a-bottle/giab_data_indexes/blob/d3c9afd4c08d9df5b2a6e94fe0692a11def4fe50/AshkenazimTrio/sequence.index.AJtrio_Illumina300X_wgs_07292015.HG002#L2

Can you please verify this?

Best, Faizal

chunlinxiao commented 1 year ago

Thanks for reporting this Faizal - We recently performed a metadata collection/analysis regarding all fastqs, involving gunzip/gzip - this may produce different md5s (from different gz file header if not using gzip -n for example). However, the uncompressed file (fastq file) are unchanged with identical md5. The sequence.index files will need to be updated accordingly.

Faizal-Eeman commented 1 year ago

Thanks for confirming @chunlinxiao! Please let me know when sequence.index files are updated with the correct checksums.

chunlinxiao commented 1 year ago

Hi @Faizal-Eeman, the md5s were updated: you can follow the link of sequence.index.AJtrio_Illumina300X_wgs_07292015_updated.

thanks

Faizal-Eeman commented 1 year ago

Great, they now match. Thanks a lot!