bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
86 stars 17 forks source link

Issue with a large Pseudomonas dataset #299

Open RyanCFink opened 5 months ago

RyanCFink commented 5 months ago

About the dataset

We have two datasets, one with 8046 Pseudomonas genomes, and another, a subset of the first, with 6347 Pseudomonas aeruginosa genomes. For the issues described here we will focus on the latter, but the issues are common to both.

About the issue

As part of our workflow, we create subsets based on the core and accessory distances, as infered by PopPUNK. In this dataset we found that many of the distances for the genomes are 0 for both the accessory and the core genomes.

Here are the command we used to extract the distance table:

  1. poppunk --create-db
poppunk \
    --create-db \
    --output poppunk_db \
    --r-files poppunk_samplesheet.tsv  \
    --threads 24
  1. poppunk_extract_distances.py
poppunk_extract_distances.py \
    --distances poppunk_db/poppunk_db.dists \
    --output poppunk_db_distances.tsv
  1. We also fit to the dbscan model:
poppunk \
    --fit-model dbscan \
    --ref-db poppunk_db \
    --output poppunk_db \
    --threads 24 \
    --k-step 3 --min-k 17 --max-k 41 --plot-fit 5

Here is the resulting figure:

  1. We then tried to subset the genomes based on their "similarity" (1-distance). Here are the genomes retained for a few different thresholds:

You can see the number of retained genomes is very small compared to the input size (6347), this is something we've only encountered in this dataset.

Other datasets of similar size have yielded more sensible subsets, e.g. an 11373 genome dataset resulted in these subsets (with a few more thresholds displayed to show how it scales):

Some things we've tried

  Graph-tools OpenMP parallelisation enabled: with 1 threads
  Running sequence QC
  Running QC on sketches
  93 samples failed
  Running distance QC
  Selected type isolate for distance QC is SAMD00012124
  5282 samples failed
  Traceback (most recent call last):
    File "/usr/local/bin/poppunk", line 11, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.9/site-packages/PopPUNK/__main__.py", line 413, in main
      raise RuntimeError('Type isolate ' + qc_dict['type_isolate'] + \
  RuntimeError: Type isolate SAMD00012124 not found in isolates after QC; check name of type isolate and QC options

Before adding to the database:

After:

johnlees commented 5 months ago

Regarding the QC options, could you try with:

--max-pi-dist 1 --max-a-dist 1 --max-zero-dist 20 --length-sigma 5 --prop-n 0.1

It would be useful to know from the output (which I think you should still get) the reasons isolates are being removed

Regarding query assignment: this does not update the distance plot but you should still get clusters assignments out. Are you having problems/errors in this output? A flag that can be helpful in these cases is --serial which means that assignments are made one by one, so if there are problematic input sequences they only cause issues in their own assignments. See also: https://poppunk.bacpop.org/troubleshooting.html#most-all-of-my-samples-merge-when-i-run-a-query

RyanCFink commented 5 months ago

We tried the extra flags you provided when running the QC, this time it ran to completion, but the underlying issue remains the same:

PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
        (with backend: sketchlib v2.0.1
         sketchlib: /usr/local/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 1 threads
Running sequence QC
Running QC on sketches
93 samples failed
Running distance QC
0 samples failed
Removing 93 sequences
Recalculating random matches with strand_preserved = False
Calculating random match chances using Monte Carlo

Done

We were wondering, since you used a similar Pseudomonas dataset in the original PopPUNK paper and in bacpop, if you could share:

johnlees commented 5 months ago

@samhorsfield96 I think you created the pseudomonas DB? Do you remember/know if there was anything special you did?

We wouldn't have changed the sequence files (e.g. removing plasmids), but we would have removed isolates. It still looks to me like you have a few isolates that need to be removed to get this to work. The other thing that would be useful is to find those which are giving zero core distance. You can do that with a command like:

sketchlib query dist poppunk_db | awk '$3 == 0 && $4 > 0.1 {print $1}' | sort | uniq -c | sort -k2,2 -n -r

Which should give you the isolates with the most 'bad' distances from most to least

samhorsfield96 commented 5 months ago

Hi both,

The database on the bacpop website was generated using:

poppunk --create-db --output db --r-files input.txt --qc-filter prune --length-sigma 1 --plot-fit 3 --min-k 17 --max-k 41 --sketch-size 100000

From memory I had to increase the sketch size and k-mer ranges to increase resolution at smaller core distances. I also reduced the genome size standard deviation boundary to remove likely contaminated genomes.

Let me know if you have any further questions.