Zero Accessory Distances and Varied Core Distances in PopPUNK Fit Graph Analysis

RyanCFink commented 7 months ago

Hello,

I'm currently analyzing a substantial dataset comprising 11,743 genomes of different Enterococcus species, including over 20 outgroup genomes from the same family and order. because of the genetic diversity, it's possible that certain genes are unique to individual genomes. In the PopPUNK fit graph (core vs. accessory distances), we're observing genes with zero accessory distances but core distances spanning from 0.23 to 0.6. This pattern raises questions on the interpretation of the analysis, as I would think that even genes present in a single genome would typically show some level of accessory distance. I am using dbscan as a model using the following command: poppunk --fit-model dbscan --K 2 --ref-db <database> --output <database> --threads 16 No error message and I am using poppunk v2.6.0 Here is an example of the scatterplot we obtain for Accessory v. Core distances:

poppunk_db_entirococcus_dbscan

What would be your interpretation of those? could there be a potential issue or a misinterpretation in how the analysis is handling or categorizing these genes? Maybe due to the combination of a large dataset with high genetic diversity?

johnlees commented 7 months ago

The points at zero accessory distance and >0.2 core are likely errors in distance estimation. This overall shape I believe happens when multiple mutations at the same site become common (is that right @nickjcroucher @samhorsfield96?)

I would consider looking at your fits with --plot-fit and consider adjusting your k-mer range. Also consider running the QC options on your data See:

Finally, you could look at the jaccard distances of those points with sketchlib directly

samhorsfield96 commented 7 months ago

Hi both, I believe this shape is due to a high proportion of k-mer mismatches even at the lowest values of k, resulting in distance calculation errors. I would suggest reducing the values of both --min-k (default=13) and --max-k (default=29), and using the --plot-fit function as John suggested to check that the relationship between k and the proportion of matching k-mers is linear.

RyanCFink commented 7 months ago

Thank you for your replies! These are very useful for me to understand what might be going on. The hypothesis that the k-mer mismatches are the source of those weird shapes is supported by the fact that a set of about 20 genomes are phylogenetically quite distant from the rest (Family and Order level). I got the info on the database we are using for the analyses and I am posting them at the end of this comment. The k-mer sizes are spanning the defaults. Do you suggest decreasing them anyway? I am having problems running the --plot-fit command. I have used different arguments, from 5 to 1,137 but I didn't get output or error messages, the database was correctly fit to a model. I have tried also with smaller databases and those also didn't produce any output. Is the extension of the graphic output .png? We are in the process of re-creating the database without those outgroups. I thought there was a way to easily remove genomes from the database, so I've been looking for it, but I couldn't find it (only ways to add or update the database).


Sketch version:         ed862592a54de54d4b556a5e0530096c85f3a08d
Number of samples:      11373
K-mer sizes:            13,17,21,25,29
Sketch size:            9984
Contains random matches:    True
Codon phased seeds:     False

johnlees commented 7 months ago

The k-mer sizes are spanning the defaults. Do you suggest decreasing them anyway?

I would try using --k-step 3 --min-k 17 --max-k 41. But the plots would be useful

I am having problems running the --plot-fit command. I have used different arguments, from 5 to 1,137 but I didn't get output or error messages, the database was correctly fit to a model. I have tried also with smaller databases and those also didn't produce any output. Is the extension of the graphic output .png?

I think they are PDF files if I recall. Can you list the files in your output directory?

We are in the process of re-creating the database without those outgroups. I thought there was a way to easily remove genomes from the database, so I've been looking for it, but I couldn't find it (only ways to add or update the database).

Use poppunk --qc-db --ref-db example_db --remove-samples samples_to_remove.txt, see QC docs.

RyanCFink commented 6 months ago

I would try using --k-step 3 --min-k 17 --max-k 41. But the plots would be useful We are trying the different k-mers steps and thresholds right now. But we're having the same problem with --plot-fit even with other datasets. The following is the list of files that we get when we use it:

We recreated the database without those outgroups and there are stil some zero distances. However the taxon we are analyzing still has a very large diversity and there might still be a good number of k-mers that don't match.

Use poppunk --qc-db --ref-db example_db --remove-samples samples_to_remove.txt, see [QC docs]

I thought this was only to be used at the database creation step, can it be used also to exclude samples from the model fitting step?

johnlees commented 6 months ago

But we're having the same problem with --plot-fit even with other datasets. The following is the list of files that we get when we use it

Can you try with the same --ref-db and --output. Or also have a look in poppunk_db/ as that's the folder I'd expect the plots to be in.

I thought this was only to be used at the database creation step, can it be used also to exclude samples from the model fitting step?

Yes, this will prune them out of the database. If you then run the model fitting step on that database (the --ref-db folder used) it will not use those samples

RyanCFink commented 5 months ago

I have repeated it numerous times now, but the pdf is nowhere to be found, unfortunately

johnlees commented 5 months ago

I have repeated it numerous times now, but the pdf is nowhere to be found, unfortunately

Unfortunately it's hard to provide more help with the plot files, we can't replicate and I'm not sure what the problem might be. My only guess would be something to do with the matplotlib backend. Perhaps another computer/environment/install could shed light on things, or seeing if you get the same issue with the example dataset.

You can also make these plots yourself if you use the jaccard output of sketchlib dist jaccard -- if you log the output this gives the y for the points plotted, the x is the k-mer length.

Going back to your original question, I would also try running the QC mode and setting a max pi/core threshold of 0.22.

bacpop / PopPUNK

Zero Accessory Distances and Varied Core Distances in PopPUNK Fit Graph Analysis #289