bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
86 stars 17 forks source link

poppunk_visualise unable to find sequence in clustering #313

Closed fgonzalez3 closed 4 weeks ago

fgonzalez3 commented 1 month ago

Versions These are the versions within my conda environment

poppunk                   2.6.5           py310heb72de9_0    bioconda
pp-sketchlib              2.1.3           py310h37665e0_0    conda-forge

Command used and output returned

Data description: I am trying to assign poppunk clusters to S. pneumoniae sequences

Command used:

poppunk_visualise --ref-db GPS_v8_ref --query-db data/GPSC_assignments --output example_viz --microreact

Output and error:

Graph-tools OpenMP parallelisation enabled: with 1 threads
PopPUNK: visualise
Loading previously refined model
Completed model loading
Building phylogeny
Writing microreact output
Cannot find CS00174 in clustering

Describe the bug

I am receiving an error stating that one of my sequences (CS00174) is not found in the clustering output files from poppunk assign, which I previously curated using the command below. When I check the clustering output files, though, I see that the missing sequence is in the cluster. I was not able to find a similar issue raised, so posting here. Thanks for your help!

rule GPSC_assignment:
    """
    Assign GPSCs to our isolate sequences
    """
    input:
        queryseqs = "data/qfile_contigs.txt"
    output:
        "data/GPSC_assignments/GPSC_assignments_clusters.csv", 
        "data/GPSC_assignments/GPSC_assignments.dists.pkl", 
        "data/GPSC_assignments/GPSC_assignments.h5", 
        "data/GPSC_assignments/GPSC_assignments.dists.npy", 
        "data/GPSC_assignments/GPSC_assignments_external_clusters.csv", 
        "data/GPSC_assignments/GPSC_assignments_unword_clusters.csv"

    conda:
        "envs/poppunk.yaml"
    shell:
        """
        poppunk_assign --db GPS_v8_ref --external-clustering GPS_v8_external_clusters.csv \
        --query {input.queryseqs} --output data/GPSC_assignments --threads 4 --update-db
        """

Here is my qfile that I use to specify the contig locations for poppunk_assign

CS00174 data/assemblies/CS00174/contigs.fa
CS00175 data/assemblies/CS00175/contigs.fa
CS00176 data/assemblies/CS00176/contigs.fa
SP00057 data/assemblies/SP00057/contigs.fa
SP00058 data/assemblies/SP00058/contigs.fa
SP00059 data/assemblies/SP00059/contigs.fa
johnlees commented 1 month ago

You might need to add --previous-clustering to point to the csv file where your query is assigned a cluster

fgonzalez3 commented 1 month ago

Looks like that was the issue, thanks for your help