Add kmer count per sample to ska nk output

danrlu commented 1 year ago

There is a really useful QC step with SKA which is to look at the total kmers for each sample. The total kmer numbers should be close to the genome size of the organism to indicate (1) good genome coverage (2) no contamination which will show more kmers, so it's always the first thing we check.

It would be helpful to have that metric for each sample in the current ska nk summary output.

Thanks for the new and more powerful implementation!

johnlees commented 1 year ago

Sure, I can add this in the next release

johnlees commented 12 months ago

@danrlu added in v0.3.1

danrlu commented 11 months ago

I compared the kmer counts in the 2 SKA versions. Each dot is a cultured isolate of Staph aureus and the samples came from a suspected outbreak. The green and yellow lines are the expected genome size, red line is y=x. I think some differences in the exact counts are fine, but in SKA2 the total kmer counts don't quite correlate with genome size anymore 🤔 Any thoughts why? I always use this total kmer counts to evaluate whether the reads cover most of the genome, and whether there are contaminations which has been handy.

ska nk seqs_ska2b.skf > seqs_ska2b_summary.txt and in output: ska_version=0.3.2 k=31 k_bits=64 rc=true k-mers=17253189 samples=46

Thanks very much!

johnlees commented 10 months ago

Sorry for the slow reply, I've been away for a bit

This is a fairly simple metric so I wouldn't expect much difference. From your question I gather you are using reads as input? I would guess this is due to differences in filtering.

In ska2 we have --min-count, --min-qual and --qual-filter which will affect which k-mers are included. Defaults are 10, 20 and middle-base respectively.

In ska fastq there is -c (default 4) which is the same as --min-count and also -f which imposes a per-fastq coverage cutoff (default 2). -q is the equivalent of --min-qual but also has default 20, and --qual-filter operates the same as strict in ska2. Finally there is -m, which does not exist in ska2:

Finally, the base call for the middle base in the split kmer is filtered to remove any bases where the minor alleles are found in more than 20% of the kmers. This is a strategy often used in read mapping to account for sequencing error, and can be adjusted using the -m option. ska2 will report ambiguity for these bases.

So, maybe I could ask you the following (even just on one or two of the outlying samples):

If you run ska2 with a strict quality filter, how do the results change?
If you run ska2 setting --min-count to 4, how to the results change?
If you run ska1 setting -m to 0, how do the results change?
If you have assemblies of these samples, do the ska1 or ska2 k-mer counts better align with their length?

danrlu commented 10 months ago

Sorry I forgot to mention that for the plot above, options were --min-count 4 --min-qual 20 --threads 4 -k 31, so per your suggestion, the only thing to add is strict. Now with --min-count 4 --min-qual 20 --threads 4 -k 31 --qual-filter strict it's looking very consistent with the old SKA. So I just didn't get the parameters right 😎

Also b/c the quality filters were applied during the ska build step, it also made the distance calculation more consistent with SKA1. Below is plotting pairwise SNP difference: With --min-count 4 --min-qual 20 --threads 4 -k 31 --qual-filter strict these samples would form a cluster with 20 SNP cutoff:

With --min-count 4 --min-qual 20 --threads 4 -k 31: If you inspect the pairwise distance matrix below: if one sample is genetically similar to another sample, I would expect a 3rd sample to have close distance to both of them, given good genomic coverage. So I was quite confused about the result below, I think I just needed the --qual-filter strict:

So combining with the earlier discussion https://github.com/bacpop/ska.rust/issues/38 ska build --min-count 4 --min-qual 20 --threads 4 -k 31 --qual-filter strict and ska distance --filter-ambiguous would give results most similar to SKA1, in case anyone wonders.

Thanks very much for the tips and patience!!

johnlees commented 10 months ago

Thanks for checking all of this and the nice plots!

This is making me think I should probably change the default value of qual-filter to strict to match SKA1's behaviour – I had a report by email also noting poor performance with just the middle base quality filter

danrlu commented 10 months ago

I like that idea~~ I don't know how you guys decided on SKA1 default settings but it worked straight out of the box and there was nothing I needed to tune (except the cutoff to define what SNP distance/similarity would be a cluster, which I want to change per species anyway). When I try a new tool, whether it works in first try or not usually makes a difference 😆 Thanks again for the faster and more powerful RUST implementation!

bacpop / ska.rust

Add kmer count per sample to ska nk output #39