genome-in-a-bottle / giab_data_indexes

This repository contains data indexes from NIST's Genome in a Bottle project.
232 stars 71 forks source link

Strong coverage deviation for 1 of 13 subdirectories of NA12878 Illumina 300x WGS #16

Open Jens-Pe opened 2 years ago

Jens-Pe commented 2 years ago

Dear GIAB team, while doing some k-mer counting using the files indexed at sequence.index.NA12878_Illumina300X_wgs_09252015, I noticed, that the total number of 25-mers in all *.fastq.gz files in 140115_D00360_0010_BH894YADXX/ (hereinafter referred to as subdirectory 0010) significantly differs from all other subdirectories at NIST_NA12878_HG001_HiSeq_300x/ (005 to 009, 0011 to 0017).

Subdirectory 0010 contains only 4,185,958,248 (N-free) 25-mers, whereas all other 12 subdirectories (005 to 009, 0011 to 0017) contain between 56,877,996,538 and 69,240,304,680 25-mers each. This difference can also be seen in the number of files and the sum of the file sizes. KMC3 outputs the same numbers of total k-mers per subfolder.

How does the low number of 25-mers in subdirectory 0010 fit to the quote "The other folders each contain ~20-30x sequencing total (a single flow cell)" in the README file?

Are you aware of this clear deviation for subdirectory 0010? Have you discussed the possible causes of this outlier subdirectory in any of your publications, which I may have missed? Can you rule out that this strong deviation for 0010 could possibly have negative effects on the whole data set?

Thanks in advance, Jens

nate-d-olson commented 2 years ago

Hi Jen, Thanks for using our data and bringing this issue to our attention. Based on the demultiplexed index stats there was a significantly lower yield for barcodes in subdirectory 10. The lower yield likely accounts for the difference in observed kmers. To evaluate sequence quality and see if contamination or sample swap could account for the low kmer count I also mapped one set of fastq files from subdirectory 10 to GRCh38 using bwa mem. The mapping results are consistent with what we would expect (see below). Therefore I don't believe the observed deviation would negatively affect the whole dataset. Let us know if you have any other questions about the dataset, especially if you find that the low observed kmer count does negatively impact the whole dataset.

Best! Nate

From samtools flagstat

1741538 + 0 in total (QC-passed reads + QC-failed reads)
1728984 + 0 primary
0 + 0 secondary
12554 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1725353 + 0 mapped (99.07% : N/A)
1712799 + 0 primary mapped (99.06% : N/A)
1728984 + 0 paired in sequencing
864492 + 0 read1
864492 + 0 read2
1661938 + 0 properly paired (96.12% : N/A)
1704050 + 0 with itself and mate mapped
8749 + 0 singletons (0.51% : N/A)
29502 + 0 with mate mapped to a different chr
17966 + 0 with mate mapped to a different chr (mapQ>=5)