Degraded performance of k-mer filter from FASTQ files

johnlees commented 12 months ago

Reported by @rderelle

I generated 2 simulated set of 60x reads from 2 genomes. When analysed using ska 0.3.0, I obtained a total of 4.348.071 kmers. But with the 'old' ska 0.2.1, I obtained 4.361.745 kmers. It seems to me that, between versions 0.2.1 and 0.3.0, perhaps too many kmers are filtered out. I then tried different version of ska:

v0.3.0 4.348.071 kmers v0.2.4 4.348.071 kmers v0.2.3 4.348.071 kmers v0.2.2 4.361.745 kmers v0.2.1 4.361.745 kmers v0.2.0 4.361.745 kmers

For all these versions, my command lines were: ska build --threads 2 -k 41 --min-count 5 -f list_files.txt -o out ska nk --full-info out.skf > out.txt

the potential issue seems to be related to the bloom filter implemented in "count_min_filter.rs".

I could increase the number of kmers observed by ska by increasing the size of the bloom filter** in v0.2.3.

** I'm not familiar with bloom filters, so I increased everything:

const BLOOM_WIDTH: usize = 1 << 28; (previously 1 << 27) const BITS_PER_ENTRY: usize = 14; (previously 12) const CM_WIDTH: usize = 1 << 28; (previously 1 << 24)

johnlees commented 12 months ago

This is clearly related to this change: https://github.com/bacpop/ska.rust/pull/20

johnlees commented 12 months ago

The older versions used a count min filter for everything, the change then added a bloom filter for singleton k-mers, then the countmin filter is used with everything seen >1 time.

Things to try:

Replace both with a dictionary (which is exact). Check runtime and memory use.
Increase the bloom filter size, also checking memory use.

@rderelle it would be good if we could get a test case set up where we can check number of k-mers. Would you be happy to send me the example read files you are using?

rderelle commented 12 months ago

Hi John,

here are the fastq files I used (simulated 150bp paired-end reads at 60x coverage from 2 M. tuberculosis genomes): https://drive.google.com/drive/folders/1goHqRI2SWiXIoqgWw_HfdRmD2yjzbHc3?usp=sharing

Thanks!

johnlees commented 12 months ago

Using a dictionary, which is exact (branch exact_dict) 1m39s; 2.4Gb k-mers=4360072 sample_kmers=[4352822, 4356260]

v0.3.1 Bloom + countmin bloom params: 2^27; 12. cm params 2^24; 3. 47.41s; 1.4Gb k-mers=4348071 sample_kmers=[4324446, 4327670]

This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)

johnlees commented 12 months ago

v0.2.2 countmin only cm params 2^27; 3. 1m56s; 1.9Gb (I think also without the only enter dictionary once optimisation. My memory is that time should be more like 1m) k-mers=4361745 sample_kmers=[4353667, 4357112]

Giving 8 FN, 1681 FP

NB: There should be no false negatives, perhaps this is singletons or another change in later versions

johnlees commented 12 months ago

@rderelle you might want to give the version on the exact_dict branch a go, and see if that recovers all the variants you expect in your downstream pipeline

johnlees commented 12 months ago

increasing sizes in v0.3.1 bloom params: 2^28; 12. cm params 2^25; 4. 53s; 1.7Gb k-mers=4359667 sample_kmers=[4351774, 4355311]

1188 FPs; 1593 FNs

This looks better, but I'm not sure why there are FNs. I need to think about why this might be

rderelle commented 12 months ago

This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)

I'm not sure we could know what are the numbers of FP and FN (the 2 genomes were different). But I might have missed some information.

I'll give it a try with version v0.3.1. Thanks!

johnlees commented 12 months ago

This gives 33031 FP; 45032 FN (seems high, am I counting right? Total diff is around 10k)

I'm not sure we could know what are the numbers of FP and FN (the 2 genomes were different). But I might have missed some information.

Sorry @rderelle, this isn't very clear above (these are mostly notes to self while I debug). On the exact_dist branch I am counting exactly so no counting errors, therefore with the filters (which have errors) I can quantify FN and FP by comparing the output of ska nk.

johnlees commented 12 months ago

I believe I have found the bug causing an elevated FN rate. Adding in some logging for an FN k-mer:

2023-07-09T20:39:03.902Z WARN  [ska::ska_dict] Sample reads/lin_1.11.fq.gz
2023-07-09T20:39:06.540Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:07.902Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:07.902Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:07.902Z WARN  [ska::ska_dict::count_min_filter] Count 2
2023-07-09T20:39:08.583Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:08.583Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:08.583Z WARN  [ska::ska_dict::count_min_filter] Count 3
2023-07-09T20:39:16.991Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:16.991Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:16.991Z WARN  [ska::ska_dict::count_min_filter] Count 9
2023-07-09T20:39:17.624Z WARN  [ska::ska_dict::count_min_filter] Kmer observed
2023-07-09T20:39:17.624Z WARN  [ska::ska_dict::count_min_filter] Passed bloom
2023-07-09T20:39:17.624Z WARN  [ska::ska_dict::count_min_filter] Count 10

So here it looks like the bloom filter is working fine (although note its FP rate is around 1% by design), but the problem is the k-mer hits a false positive in the countmin filter. K-mers are only added when the count is exactly the same as the minimum, to avoid doing multiple lookups once already added. So this skips over this criteria, and the k-mer isn't added.

It might be hard to pick sizes for the CM and bloom filter that give good error rates in every case, but I will try and optimise them a bit here. I don't really want to give these as options for the user as it's too hard to set good values. Perhaps, as in sketchlib, I could add an exact k-mer counter as an option. Probably trading ~2x runtime and memory will be worth it for many.

johnlees commented 12 months ago

Performance of combined filters (c.f. exact 99s, 2.4 Gb, 0 FN, 0 FP)

table is a WIP which I'll update as I run more parameter combinations (don't take exact CPU time too seriously, compiler flags and other stuff I was running varied)

cm width	cm height	bloom width	time	memory	FP	FN
2^24	3	2^27	47.41s	1.4Gb	33031	45032
2^25	4	2^28	53s	1.7Gb	1188	1593
2^28	3	2^27	57s	2.7Gb	26	36
2^27	4	2^27	60s	2.2Gb	34	47
2^24	3	2^29	67s	1.5Gb	33011	45021
2^24	8	2^27	85s	1.7Gb	1255	1690
2^26	4	2^27	68s	1.7Gb	213	281
NA (dict)	NA (dict)	2^27	46s	1.4Gb	0	0

So far: bloom filter width looking like it's well set. Countmin tradeoff could be better, some increase in both width and height probably best.

Also worth trying: bloom filter + exact dict.

edit: Replacing the countmin filter with a dict is looking like the best option here. One final thing to try is a double bloom filter.

johnlees commented 12 months ago

Adding a second bloom filter for the two-counts doesn't help, so will go ahead with single bloom + hashmap

johnlees commented 12 months ago

@rderelle this should be fixed in v0.3.1, which has just been released

rderelle commented 12 months ago

Works great. Thanks!

bacpop / ska.rust

Degraded performance of k-mer filter from FASTQ files #45