bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
86 stars 17 forks source link

databases: reference only vs all genomes #269

Closed sapuizait closed 7 months ago

sapuizait commented 1 year ago

Hi there

Thank you for your wonderful software. It will be very useful in my line of work to identify the same strains from a given species using kmers! Also, the speed of the software is amazing! For most of the analyses, I can simply run it on my desktop PC!

However, could you please explain the difference between using the full dataset of all genomes or only the references. You mention that for more detailed analyses the full dataset is better, but what type of analyses? Is it in order to see if my genomes fall within clusters of previously identified genomes? or is the separation better? For example, for my purposes, I simply may wanna say if an isolate or a MAG are likely the same strain/serotype. Wouldnt the reference db be enough for that?

Thanks! p

johnlees commented 1 year ago

From https://www.bacpop.org/poppunk/ 'For more detailed analyses, you may wish to download the all genomes database. If you wish to run either poppunk-visualise or any subclustering within strains this will require the full database.'

So for your purposes the reference only version will be find

You only need the 'full' database if you wish to run visualisations where you get an NJ tree of the whole database, or if you want further levels of subclustering.

Some more information is here: https://poppunk.readthedocs.io/en/latest/model_distribution.html