Closed danrlu closed 10 months ago
Sure, I can add this in the next release
@danrlu added in v0.3.1
I compared the kmer counts in the 2 SKA versions. Each dot is a cultured isolate of Staph aureus and the samples came from a suspected outbreak. The green and yellow lines are the expected genome size, red line is y=x. I think some differences in the exact counts are fine, but in SKA2 the total kmer counts don't quite correlate with genome size anymore 🤔 Any thoughts why? I always use this total kmer counts to evaluate whether the reads cover most of the genome, and whether there are contaminations which has been handy.
ska nk seqs_ska2b.skf > seqs_ska2b_summary.txt
and in output:
ska_version=0.3.2
k=31
k_bits=64
rc=true
k-mers=17253189
samples=46
Thanks very much!
Sorry for the slow reply, I've been away for a bit
This is a fairly simple metric so I wouldn't expect much difference. From your question I gather you are using reads as input? I would guess this is due to differences in filtering.
In ska2 we have --min-count
, --min-qual
and --qual-filter
which will affect which k-mers are included. Defaults are 10, 20 and middle-base respectively.
In ska fastq
there is -c
(default 4) which is the same as --min-count
and also -f
which imposes a per-fastq coverage cutoff (default 2). -q
is the equivalent of --min-qual
but also has default 20, and --qual-filter
operates the same as strict
in ska2. Finally there is -m
, which does not exist in ska2:
Finally, the base call for the middle base in the split kmer is filtered to remove any bases where the minor alleles are found in more than 20% of the kmers. This is a strategy often used in read mapping to account for sequencing error, and can be adjusted using the -m option. ska2 will report ambiguity for these bases.
So, maybe I could ask you the following (even just on one or two of the outlying samples):
-m
to 0, how do the results change?Sorry I forgot to mention that for the plot above, options were --min-count 4 --min-qual 20 --threads 4 -k 31
, so per your suggestion, the only thing to add is strict
.
Now with --min-count 4 --min-qual 20 --threads 4 -k 31 --qual-filter strict
it's looking very consistent with the old SKA. So I just didn't get the parameters right 😎
Also b/c the quality filters were applied during the ska build
step, it also made the distance calculation more consistent with SKA1. Below is plotting pairwise SNP difference:
With --min-count 4 --min-qual 20 --threads 4 -k 31 --qual-filter strict
these samples would form a cluster with 20 SNP cutoff:
With --min-count 4 --min-qual 20 --threads 4 -k 31
:
If you inspect the pairwise distance matrix below: if one sample is genetically similar to another sample, I would expect a 3rd sample to have close distance to both of them, given good genomic coverage. So I was quite confused about the result below, I think I just needed the
--qual-filter strict
:
So combining with the earlier discussion https://github.com/bacpop/ska.rust/issues/38 ska build --min-count 4 --min-qual 20 --threads 4 -k 31 --qual-filter strict
and ska distance --filter-ambiguous
would give results most similar to SKA1, in case anyone wonders.
Thanks very much for the tips and patience!!
Thanks for checking all of this and the nice plots!
This is making me think I should probably change the default value of qual-filter to strict to match SKA1's behaviour – I had a report by email also noting poor performance with just the middle base quality filter
I like that idea~~ I don't know how you guys decided on SKA1 default settings but it worked straight out of the box and there was nothing I needed to tune (except the cutoff to define what SNP distance/similarity would be a cluster, which I want to change per species anyway). When I try a new tool, whether it works in first try or not usually makes a difference 😆 Thanks again for the faster and more powerful RUST implementation!
There is a really useful QC step with SKA which is to look at the total kmers for each sample. The total kmer numbers should be close to the genome size of the organism to indicate (1) good genome coverage (2) no contamination which will show more kmers, so it's always the first thing we check.
It would be helpful to have that metric for each sample in the current
ska nk
summary output.Thanks for the new and more powerful implementation!