Closed johnlees closed 12 months ago
This is clearly related to this change: https://github.com/bacpop/ska.rust/pull/20
The older versions used a count min filter for everything, the change then added a bloom filter for singleton k-mers, then the countmin filter is used with everything seen >1 time.
Things to try:
@rderelle it would be good if we could get a test case set up where we can check number of k-mers. Would you be happy to send me the example read files you are using?
Hi John,
here are the fastq files I used (simulated 150bp paired-end reads at 60x coverage from 2 M. tuberculosis genomes): https://drive.google.com/drive/folders/1goHqRI2SWiXIoqgWw_HfdRmD2yjzbHc3?usp=sharing
Thanks!
Using a dictionary, which is exact (branch exact_dict
)
1m39s; 2.4Gb
k-mers=4360072
sample_kmers=[4352822, 4356260]
v0.3.1 Bloom + countmin bloom params: 2^27; 12. cm params 2^24; 3. 47.41s; 1.4Gb k-mers=4348071 sample_kmers=[4324446, 4327670]
This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)
v0.2.2 countmin only cm params 2^27; 3. 1m56s; 1.9Gb (I think also without the only enter dictionary once optimisation. My memory is that time should be more like 1m) k-mers=4361745 sample_kmers=[4353667, 4357112]
Giving 8 FN, 1681 FP
NB: There should be no false negatives, perhaps this is singletons or another change in later versions
@rderelle you might want to give the version on the exact_dict
branch a go, and see if that recovers all the variants you expect in your downstream pipeline
increasing sizes in v0.3.1 bloom params: 2^28; 12. cm params 2^25; 4. 53s; 1.7Gb k-mers=4359667 sample_kmers=[4351774, 4355311]
1188 FPs; 1593 FNs
This looks better, but I'm not sure why there are FNs. I need to think about why this might be
This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)
I'm not sure we could know what are the numbers of FP and FN (the 2 genomes were different). But I might have missed some information.
I'll give it a try with version v0.3.1. Thanks!
This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)
I'm not sure we could know what are the numbers of FP and FN (the 2 genomes were different). But I might have missed some information.
Sorry @rderelle, this isn't very clear above (these are mostly notes to self while I debug). On the exact_dist
branch I am counting exactly so no counting errors, therefore with the filters (which have errors) I can quantify FN and FP by comparing the output of ska nk
.
I believe I have found the bug causing an elevated FN rate. Adding in some logging for an FN k-mer:
2023-07-09T20:39:03.902Z WARN [ska::ska_dict] Sample reads/lin_1.11.fq.gz
2023-07-09T20:39:06.540Z WARN [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:07.902Z WARN [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:07.902Z WARN [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:07.902Z WARN [ska::ska_dict::count_min_filter] Count 2
2023-07-09T20:39:08.583Z WARN [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:08.583Z WARN [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:08.583Z WARN [ska::ska_dict::count_min_filter] Count 3
2023-07-09T20:39:16.991Z WARN [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:16.991Z WARN [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:16.991Z WARN [ska::ska_dict::count_min_filter] Count 9
2023-07-09T20:39:17.624Z WARN [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:17.624Z WARN [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:17.624Z WARN [ska::ska_dict::count_min_filter] Count 10
So here it looks like the bloom filter is working fine (although note its FP rate is around 1% by design), but the problem is the k-mer hits a false positive in the countmin filter. K-mers are only added when the count is exactly the same as the minimum, to avoid doing multiple lookups once already added. So this skips over this criteria, and the k-mer isn't added.
It might be hard to pick sizes for the CM and bloom filter that give good error rates in every case, but I will try and optimise them a bit here. I don't really want to give these as options for the user as it's too hard to set good values. Perhaps, as in sketchlib, I could add an exact k-mer counter as an option. Probably trading ~2x runtime and memory will be worth it for many.
Performance of combined filters (c.f. exact 99s, 2.4 Gb, 0 FN, 0 FP)
table is a WIP which I'll update as I run more parameter combinations (don't take exact CPU time too seriously, compiler flags and other stuff I was running varied)
cm width | cm height | bloom width | time | memory | FP | FN |
---|---|---|---|---|---|---|
2^24 | 3 | 2^27 | 47.41s | 1.4Gb | 33031 | 45032 |
2^25 | 4 | 2^28 | 53s | 1.7Gb | 1188 | 1593 |
2^28 | 3 | 2^27 | 57s | 2.7Gb | 26 | 36 |
2^27 | 4 | 2^27 | 60s | 2.2Gb | 34 | 47 |
2^24 | 3 | 2^29 | 67s | 1.5Gb | 33011 | 45021 |
2^24 | 8 | 2^27 | 85s | 1.7Gb | 1255 | 1690 |
2^26 | 4 | 2^27 | 68s | 1.7Gb | 213 | 281 |
NA (dict) | NA (dict) | 2^27 | 46s | 1.4Gb | 0 | 0 |
So far: bloom filter width looking like it's well set. Countmin tradeoff could be better, some increase in both width and height probably best.
Also worth trying: bloom filter + exact dict.
edit: Replacing the countmin filter with a dict is looking like the best option here. One final thing to try is a double bloom filter.
Adding a second bloom filter for the two-counts doesn't help, so will go ahead with single bloom + hashmap
@rderelle this should be fixed in v0.3.1, which has just been released
Works great. Thanks!
Reported by @rderelle
I generated 2 simulated set of 60x reads from 2 genomes. When analysed using ska 0.3.0, I obtained a total of 4.348.071 kmers. But with the 'old' ska 0.2.1, I obtained 4.361.745 kmers. It seems to me that, between versions 0.2.1 and 0.3.0, perhaps too many kmers are filtered out. I then tried different version of ska:
v0.3.0 4.348.071 kmers v0.2.4 4.348.071 kmers v0.2.3 4.348.071 kmers v0.2.2 4.361.745 kmers v0.2.1 4.361.745 kmers v0.2.0 4.361.745 kmers
For all these versions, my command lines were: ska build --threads 2 -k 41 --min-count 5 -f list_files.txt -o out ska nk --full-info out.skf > out.txt
the potential issue seems to be related to the bloom filter implemented in "count_min_filter.rs".
I could increase the number of kmers observed by ska by increasing the size of the bloom filter** in v0.2.3.
** I'm not familiar with bloom filters, so I increased everything:
const BLOOM_WIDTH: usize = 1 << 28; (previously 1 << 27) const BITS_PER_ENTRY: usize = 14; (previously 12) const CM_WIDTH: usize = 1 << 28; (previously 1 << 24)