bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
88 stars 18 forks source link

Clusters.csv reports all samples belong to the same cluster #197

Closed VishnuRaghuram94 closed 2 years ago

VishnuRaghuram94 commented 2 years ago

Versions

poppunk 2.4.0
poppunk_sketch 1.7.4

Command used and output returned

poppunk --fit-model threshold --output threshold_poppunk --ref-db database_poppunk --threshold 0.0005 --threads 40 Describe the bug

I am trying to cluster a set of ~60,000 S. aureus genomes. I would ideally like to cluster genomes that are ~ 50 SNPs apart. I compared the core genome SNP distance vs the poppunk reported core distance for a small subsample of isolates and decided on a poppunk threshold of 0.0005 which corresponds to a median SNP distance of ~ 50.

I then used the commands

poppunk --create-db --output database_poppunk --r-files poppunk_query_list.txt --threads 40 --qc-filter continue

poppunk --fit-model threshold --output threshold_poppunk --ref-db staphopiaV2_poppunk --threshold 0.0005 --threads 40

The first command completed without any errors but the second command stopped with the error RecursionError: maximum recursion depth exceeded while calling a Python object

The resulting clusters.csv file states all samples belong to the same cluster. The Core vs Accessory distance graph is attached below image

I also attempted this with the lineage model and dbscan, and the resulting clusters.csv file in both cases reported all samples belonging to the same cluster. Running with lineage and dbscan did not present any errors.

Do you have any suggestions as to how I may proceed? Ultimately I just want to cluster genomes that are approximately 50 SNPs apart.

Thank you Vishnu

johnlees commented 2 years ago

You very likely have some low quality/contaminated data in your collection. All the points with pi > 0.1 and those with high accessory/low core are going to cause problems. I would suggest re-running the create db step removing samples until you see those points have been remove from the plot.

We've got a page with some suggestions on how to go about this here: https://poppunk.readthedocs.io/en/latest/qc.html

VishnuRaghuram94 commented 2 years ago

Hi,

Thank you for your help. I removed low quality samples and samples with pi > 0.1 and got the following graph after create db image

And ran the following fit-model command poppunk --fit-model threshold --threads 16 --ref-db poppunk_database_maxpidist0.1 --threshold 0.001 --output poppunk_threshold_fitmodel_maxpidist0.1 --max-pi-dist 0.1 --max-a-dist 0.4

The output was as follows. Still ran into the same problem of all samples belonging to the same cluster.

Graph-tools OpenMP parallelisation enabled: with 16 threads
Mode: Fitting threshold model to reference database

Selected type isolate for distance QC is DRX014040
Network summary:
        Components                              1
        Density                                 0.0707
        Transitivity                            0.8749
        Mean betweenness                        0.1542
        Weighted-mean betweenness               0.1542
        Score                                   0.8130
        Score (w/ betweenness)                  0.6876
        Score (w/ weighted-betweenness)         0.6876
Removing 42721 sequences

Done

There still seems to be a few samples with high accessory but low core distance, though I specified max-a-dist 0.4 . Are these still the issue? image

Please let me know if you have any advice.

Best, Vishnu

nickjcroucher commented 2 years ago

@VishnuRaghuram94 - if you're using a core threshold, these odd distances are likely to be the problem - I would define a --type-isolate you trust and use the maximum distances to filter out everything that diverges from this isolate. Note that --qc-filter continue keeps the dodgy sequences, whereas --qc-filter prune will filter them out.

VishnuRaghuram94 commented 2 years ago

I used poppunk_extract_distances.py to extract pairwise distances from the database generated above and extracted all pairs with core distance > 0.05 and accessory distance > 0.4. I then manually removed these isolates from my initial list of queries. I then ran the following database building command.

poppunk --create-db --output poppunk_filtered_db_maxpidist0.1 --r-files filtered_query_list.txt --threads 40 --qc-filter prune --max-pi-dist 0.05 --max-a-dist 0.4

I got the following graph as a result - the low quality samples appear to have been removed: image

I then ran poppunk --fit-model threshold --threads 40 --ref-db poppunk_filtered_db_maxpidist0.1 --threshold 0.001 --output poppunk_filtered_threshold_fitmodel_maxpidist0.1_DRX014091 --max-pi-dist 0.05 --max-a-dist 0.4 --type-isolate DRX014040

This was the graph I got after: image

Again, appears as if there are no more low quality samples. However clusters.csv still reports all samples belonging to the same cluster. The number of components is 1. I tried this with different type isolates and still got the same result.

PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
        (with backend: sketchlib v1.7.4
         sketchlib: /local/home/vraghu2/miniconda3/envs/poppunk_gpu/lib/python3.8/site-package
s/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 40 threads
Mode: Fitting threshold model to reference database

Selected type isolate for distance QC is DRX014040
Network summary:
        Components                              1
        Density                                 0.0675
        Transitivity                            0.9185
        Mean betweenness                        0.4573
        Weighted-mean betweenness               0.4573
        Score                                   0.8565
        Score (w/ betweenness)                  0.4648
        Score (w/ weighted-betweenness)         0.4648
Removing 37858 sequences

Done
johnlees commented 2 years ago

Your distance plot looks good, so I think your QC has worked to remove the bad samples originally mentioned. We have successfully made poppunk clusters with S. aureus before so this should work.

I notice the scores including betweenness are quite a bit lower. This makes me expect that the remaining issue is that there are some samples with (incorrect) zero distances joining everything together. I would recommend two things:

For the second, there is a slightly outdated example here: https://poppunk.readthedocs.io/en/latest/qc.html#with-the-distances In the newer code, you should be able to run something like:

poppunk_sketch --query --ref-db poppunk_filtered_db_maxpidist0.1 \
--query-db poppunk_filtered_db_maxpidist0.1 --cpus 40 --print | \
awk '$3 == 0 || $4 == 0 {print $0}' > dist_qc.txt

You can then do a sort | uniq -c on the dist_qc.txt file to find the samples that have lots of zero links.

nickjcroucher commented 2 years ago

@VishnuRaghuram94 - if that doesn't reveal any isolates, you could try looking at the network in Cytoscape (https://poppunk.readthedocs.io/en/latest/visualisation.html) to visually inspect for any problematic sequences.