bacpop / ska.rust

Split k-mer analysis – version 2
https://docs.rs/ska/latest/ska/
Apache License 2.0
56 stars 4 forks source link

strange behaviour for min-count (ska build) #56

Closed rderelle closed 8 months ago

rderelle commented 9 months ago

Using ska2 v0.3.1, I analysed a large dataset of 330 Mtb samples at a kmer size of 51, and tested 2 different values of min-count (all samples have been filtered to have at least a coverage of 30x): min-count = 10 (default) -> alignment of 247 SNPs min-count = 6 -> alignment of 201 SNPs

Shouldn't we expect more SNPs using min-count = 6 since we are incorporating more kmers to the analysis? I obtained similar results for the detection of indels with these 2 min-count values using skalo.

Thanks!

johnlees commented 9 months ago

I suppose this could be more ambiguous sites which are then being filtered out. What command(s) did you run? It would also be informative to know what those extra/missing sites are

rderelle commented 9 months ago

My apologies, here are some more information: I first create a SNP alignment using "ska align --min-freq 0.8 --filter no-ambig-or-const", and I obtain alignments with more than 700k positions since split-kmers are often missing in a few low coverage samples. I then trim these alignments using a Python script that only keeps positions with a min-freq 0.8 and at least 2 different character states ('-' and 'N' being ignored, not considered a character state). So my script keeps positions with ambiguous nucleotides other than 'N'.

I'll try to dig into this.

rderelle commented 9 months ago

I tried again with "ska align --min-freq 0.8" (i.e., no filter of ambiguous positions). My results: min-count = 10 (default) -> alignment of 470 positions min-count = 6 -> alignment of 698 positions

So, as you suggested, lowering the min-count does indeed produce more ambiguous positions. But I still do not understand why it produces less SNPs (i.e. positions with 2 different nucleotides). Their number should be equal or higher with n=6, not lower, since at k=51 we wouldn't expect spurious kmers due to sequencing errors to be detected more than 6 times. No?

johnlees commented 9 months ago

I’m not actually sure — I think the best thing would to be investigate the examples that differ to work out why. If you can share an example I can look myself, but not for a few weeks (sorry!)

rderelle commented 9 months ago

The only hypothesis I had was that the size of the bloom filter was linked to the mincount, but I had a quick look at the code and it doe not seem to be the case.

Anyway, I put some skf and SNP alignment files, based on a smaller dataset, on my google drive to be look at in November (I'll continue my analyses with the default mincount value): https://drive.google.com/drive/folders/1BZp0YG16KdChQ9I5IdUfrztOFgcKU3N-?usp=sharing

johnlees commented 8 months ago

@rderelle I am fairly confident that the min count filtering is working as expected, without a clear example of this happening (or at least a site where it is) I can't attribute it to this step. I think it's far more likely to be the filtering criteria applied.

With the 0.3.4 candidate in the linked PR we now have the --no-gap-only-sites option. If I run:

ska weed -o out_gc_k51_n6_weed.skf --no-gap-only-sites --filter no-const -v out_gc_k51_n6__2.2.1.skf

I get a number of k-mers consistent with your expectations: min count 6: k-mers=2994 min count 10: k-mers=1035

Adding the frequency filter at 0.8: min count 6: k-mers=774 min count 10: k-mers=249

These are different to your python script. Perhaps that isn't working as expected?

johnlees commented 8 months ago

Further supporting this being a filtering difference, the numbers prior to filtering are consistent with min count working correctly: min count 6: k-mers=6358146 min count 10: k-mers=4824172

rderelle commented 8 months ago

@johnlees Thanks a lot for looking into this!

With the file you used for testing (out_gc_k51_n6__2.2.1.skf), we would expect the number of SNPs to be around 100 or less since this dataset only contains 53 samples.

I'll test with the version 0.3.4 when it's available for download. I'll also look again into my Python script (it's a fairly simple script but I can't rule out a bug of course).

johnlees commented 8 months ago

With the file you used for testing (out_gc_k51_n6__2.2.1.skf), we would expect the number of SNPs to be around 100 or less since this dataset only contains 53 samples.

If I also filter the ambiguous ones there are 97 sites left. 0.3.4 is out if you want to test.