Closed rderelle closed 8 months ago
I suppose this could be more ambiguous sites which are then being filtered out. What command(s) did you run? It would also be informative to know what those extra/missing sites are
My apologies, here are some more information: I first create a SNP alignment using "ska align --min-freq 0.8 --filter no-ambig-or-const", and I obtain alignments with more than 700k positions since split-kmers are often missing in a few low coverage samples. I then trim these alignments using a Python script that only keeps positions with a min-freq 0.8 and at least 2 different character states ('-' and 'N' being ignored, not considered a character state). So my script keeps positions with ambiguous nucleotides other than 'N'.
I'll try to dig into this.
I tried again with "ska align --min-freq 0.8" (i.e., no filter of ambiguous positions). My results: min-count = 10 (default) -> alignment of 470 positions min-count = 6 -> alignment of 698 positions
So, as you suggested, lowering the min-count does indeed produce more ambiguous positions. But I still do not understand why it produces less SNPs (i.e. positions with 2 different nucleotides). Their number should be equal or higher with n=6, not lower, since at k=51 we wouldn't expect spurious kmers due to sequencing errors to be detected more than 6 times. No?
I’m not actually sure — I think the best thing would to be investigate the examples that differ to work out why. If you can share an example I can look myself, but not for a few weeks (sorry!)
The only hypothesis I had was that the size of the bloom filter was linked to the mincount, but I had a quick look at the code and it doe not seem to be the case.
Anyway, I put some skf and SNP alignment files, based on a smaller dataset, on my google drive to be look at in November (I'll continue my analyses with the default mincount value): https://drive.google.com/drive/folders/1BZp0YG16KdChQ9I5IdUfrztOFgcKU3N-?usp=sharing
@rderelle I am fairly confident that the min count filtering is working as expected, without a clear example of this happening (or at least a site where it is) I can't attribute it to this step. I think it's far more likely to be the filtering criteria applied.
With the 0.3.4 candidate in the linked PR we now have the --no-gap-only-sites
option. If I run:
ska weed -o out_gc_k51_n6_weed.skf --no-gap-only-sites --filter no-const -v out_gc_k51_n6__2.2.1.skf
I get a number of k-mers consistent with your expectations: min count 6: k-mers=2994 min count 10: k-mers=1035
Adding the frequency filter at 0.8: min count 6: k-mers=774 min count 10: k-mers=249
These are different to your python script. Perhaps that isn't working as expected?
Further supporting this being a filtering difference, the numbers prior to filtering are consistent with min count working correctly: min count 6: k-mers=6358146 min count 10: k-mers=4824172
@johnlees Thanks a lot for looking into this!
With the file you used for testing (out_gc_k51_n6__2.2.1.skf), we would expect the number of SNPs to be around 100 or less since this dataset only contains 53 samples.
I'll test with the version 0.3.4 when it's available for download. I'll also look again into my Python script (it's a fairly simple script but I can't rule out a bug of course).
With the file you used for testing (out_gc_k51_n6__2.2.1.skf), we would expect the number of SNPs to be around 100 or less since this dataset only contains 53 samples.
If I also filter the ambiguous ones there are 97 sites left. 0.3.4 is out if you want to test.
Using ska2 v0.3.1, I analysed a large dataset of 330 Mtb samples at a kmer size of 51, and tested 2 different values of min-count (all samples have been filtered to have at least a coverage of 30x): min-count = 10 (default) -> alignment of 247 SNPs min-count = 6 -> alignment of 201 SNPs
Shouldn't we expect more SNPs using min-count = 6 since we are incorporating more kmers to the analysis? I obtained similar results for the detection of indels with these 2 min-count values using skalo.
Thanks!