bcgsc / ntCard

Estimating k-mer coverage histogram of genomics data
http://www.bcgsc.ca/platform/bioinfo/software/ntcard
MIT License
76 stars 9 forks source link

Jagged kmer coverage profiles with gzipped FASTA #46

Open warrenlr opened 3 years ago

warrenlr commented 3 years ago

We discovered inconsistencies in kmer histograms on two experimental ONT datasets between uncompressed and compressed FASTA input files*. In independent runs and testing different k values (16,18,20,22,25), two gzipped FASTA ONT (NA19240 [PRJEB29523] and NA12878 [SRR10965087]) read files yielded jagged and uninterpretable kmer profiles. Problem exacerbated at higher k vals. Issue observed with ntcard v1.1.1, v1.2.1 and v1.2.2.

NA12878 ONT FASTA HG12878_FASTAlog10

NA12878 ONT FASTA GZIPPED HG12878_GZFASTA_log10

====

NA19240 ONT FASTA NA19240log10FASTAuncompressed

NA19240 ONT FASTA GZIPPED NA19240log10FASTAcompressed

*We have only observed this with FASTA files, not FASTQ files and only when using experimental nanopore data

hmohamadi commented 3 years ago

Might be due to streaming in compressed multiline/single-line fasta records. Can you give this a try with ntCard v1.1.1?

warrenlr commented 3 years ago

yes, "Issue observed with ntcard v1.1.1, v1.2.1 and v1.2.2"

hmohamadi commented 3 years ago

thanks. will investigate this.