bacpop / ska.rust

Split k-mer analysis – version 2
https://docs.rs/ska/latest/ska/
Apache License 2.0
56 stars 4 forks source link

Feature requests: different mixture model; combined ska build & cov #50

Open rderelle opened 11 months ago

rderelle commented 11 months ago

Hi John,

Thanks a lot for implementing the calculation of the optimal kmer count threshold (ska cov). I tested it using ska 0.3.1 and a paired-end sample (fastq files available here: https://drive.google.com/drive/folders/1HVO-6mOd7bh7CPOjXhA3lWAT0GA_8SC8?usp=sharing). My command line was:

./ska_0.3.1 cov reads_kraken/ERR8158003_1.fastq.gz reads_kraken/ERR8158003_2.fastq.gz -k 31 > cov.txt

Given the distribution of kmer counts, the cutoff should be around 4-6, but ska outputs a value of 17. I think this high treeshold is related to the "Error" message below obtained for low kmer sizes during the model fittng.

This is the beginning of the file 'cov.txt':

Count K_mers Mixture_density Component 1 25838523 3.2408096206707826e-1 Error 2 1993278 1.620404810335391e-1 Error 3 129884 5.4013493677846365e-2 Error 4 97141 1.3503373419461602e-2 Error 5 28053 2.7006746838923196e-3 Error 6 35628 4.5011244731538646e-4 Error 7 24702 6.430177818791246e-5 Error 8 35486 8.03772227348904e-6 Error 9 30571 8.93080252609934e-7 Error 10 39312 8.930802526126602e-8 Error 11 40110 8.118911389065957e-9 Error 12 44332 6.765759585478362e-10 Error 13 43658 5.20443537193471e-11 Error 14 49357 3.7176916177683685e-12 Error 15 47325 2.4891833425171915e-13 Error 16 51653 2.009020801325787e-14 Error 17 52514 1.921693705685228e-14 Coverage 18 53472 6.883935794424811e-14 Coverage 19 52873 2.4488922538072723e-13 Coverage 20 53588 8.282019181080119e-13 Coverage 21 54322 2.667583813189892e-12 Coverage 22 54876 8.201562459958256e-12 Coverage 23 55197 2.411959247393554e-11 Coverage 24 54703 6.797667675764525e-11 Coverage 25 54701 1.8391668285808207e-10 Coverage 26 54189 4.784636868075234e-10 Coverage 27 52711 1.1986335327845915e-9 Coverage 28 52430 2.895540187815414e-9 Coverage 29 51131 6.753560645868026e-9 Coverage 30 51195 1.522694413626287e-8 Coverage 31 49749 3.322402658582442e-8 Coverage 32 48342 7.02268990941827e-8 Coverage 33 48280 1.4394306882794722e-7 Coverage 34 47080 2.863604561107574e-7 Coverage 35 46010 5.534089852758182e-7 Coverage 36 45616 1.039788261971188e-6 Coverage 37 45274 1.9008348747272694e-6 Coverage 38 44701 3.383467426815247e-6 Coverage 39 44463 5.868114750185855e-6 Coverage 40 43372 9.922927345843403e-6 Coverage 41 43026 1.637031965870808e-5 Coverage

Also, would it be possible to perform the calculation of the estimated cutoff 'on the fly' while extracting the kmers from fastq files? As it is, we have to run ska twice (first to estimate the cutoff and then to build the kmer dictionary), which kind of defeat the purpose (twice the runtimes).

Thanks and best, Romain

rderelle commented 11 months ago

Here is the coverage histogram coverage_histogram

johnlees commented 11 months ago

Thanks for the testing and files. The cov.txt file is actually fine, the coverage/error column is actually just to assist with the plotting.

It looks like the default mixture model just doesn't fit this data that well. I'll have a play around with some other types of mixture for the second component to see if we can offer an alternative in these cases.

I'll also think about adding an auto option rather than running ska cov separately, that might be easier after fixing #45. Of course you do end up counting k-mers twice this way if you run on every read set, but it's still only about ~60s per sample.