Open rderelle opened 1 year ago
Here is the coverage histogram
Thanks for the testing and files. The cov.txt
file is actually fine, the coverage/error column is actually just to assist with the plotting.
It looks like the default mixture model just doesn't fit this data that well. I'll have a play around with some other types of mixture for the second component to see if we can offer an alternative in these cases.
I'll also think about adding an auto option rather than running ska cov
separately, that might be easier after fixing #45. Of course you do end up counting k-mers twice this way if you run on every read set, but it's still only about ~60s per sample.
Hi John,
Thanks a lot for implementing the calculation of the optimal kmer count threshold (ska cov). I tested it using ska 0.3.1 and a paired-end sample (fastq files available here: https://drive.google.com/drive/folders/1HVO-6mOd7bh7CPOjXhA3lWAT0GA_8SC8?usp=sharing). My command line was:
Given the distribution of kmer counts, the cutoff should be around 4-6, but ska outputs a value of 17. I think this high treeshold is related to the "Error" message below obtained for low kmer sizes during the model fittng.
This is the beginning of the file 'cov.txt':
Also, would it be possible to perform the calculation of the estimated cutoff 'on the fly' while extracting the kmers from fastq files? As it is, we have to run ska twice (first to estimate the cutoff and then to build the kmer dictionary), which kind of defeat the purpose (twice the runtimes).
Thanks and best, Romain