bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
92 stars 20 forks source link

Accessory distance of zero for outlier samples #98

Closed lvclark closed 4 years ago

lvclark commented 4 years ago

I'm working with a group that is doing some benchmarking of microbial GWAS tools, and one of the datasets we're working with is Peptoclostridium difficile, downloaded from PATRIC. I did a preliminary run of PopPUNK and saw some outliers with high core distance and accessory distance of zero, shown below.

lm_azithromycin_distanceDistribution

I ran poppunk_extract_distances.py to inspect things a bit further, and found that three accessions were responsible for these outlier distances. I ran Mash with default settings on these three samples and five other random ones, and found pretty strong clustering. So, probably they are a misidentified species or have some sort of technical issue going on and should be removed from the dataset. In fact two out of the three no longer seem to be on PATRIC, although I can share the FASTA files with you if it would be helpful for debugging.

Regardless of the fact that I'm discarding these samples, why would the accessory distance be zero? It seems very odd.

nickjcroucher commented 4 years ago

If there are very few matching k-mers between divergent samples, then stochasticity in the observed data points may result in the inference of a relatively rapid drop off in k-mer matching with increasing k-mer length (although the absolute numbers may be low) - this is interpreted by PopPUNK as high core divergence (high gradient), and low accessory divergence (because the intercept is close to 1, which is the limit). To avoid this, shorter k-mer lengths can be included in the studied range (to give greater power to detect the actual gradient), or the diversity of the collection can be reduced (as you have done). Hope that's helpful.

johnlees commented 4 years ago

Just to add a bit on how to diagnose this further if you wish: If you add --plot-fit you can see some of the regressions between k-mer length and matches. For those with a = 0 you'd see a bad fit to the model across the current k-mer range. We don't have a way of directly extracting specific plots here (just a random subset is chosen to get an overall picture). If you wanted to look in more detail you can use pp-sketchlib (our mash equivalent) directly with the --jaccard option to see the actual values being fit across the k-mer range for all sample pairs.

As @nickjcroucher says, reducing --min-k or removing samples is the way to fix this

lvclark commented 4 years ago

Thanks! I was using --min-k 14 as any lower than that was giving an error. I'll just go forward with removing those samples.