bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
92 stars 19 forks source link

Taxon not beeing assigned correctly #333

Open nermze opened 2 days ago

nermze commented 2 days ago

Versions

conda env with python 3.8
poppunk v2.7.0
pp-sketchlib 2.1.3 (also tried 2.0.0)

Command used and output returned

poppunk_assign --db ../Haemophilus_influenzae_v2_refs --query infile.txt --output poppunk_clusters --threads 8

PopPUNK: assign
    (with backend: sketchlib v2.0.0
     sketchlib: /home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences

Graph-tools OpenMP parallelisation enabled: with 8 threads
Sketching 228 genomes using 8 thread(s)
Progress (CPU): 228 / 228
Writing sketches to file
Loading previously refined model
Completed model loading
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%
Loading network from ../Haemophilus_influenzae_v2_refs/Haemophilus_influenzae_v2_refs.refs_graph.gt
Network loaded: 339 samples
Found novel query clusters. Calculating distances between them.
Calculating all query-query distances
Calculating random match chances using Monte Carlo
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%

Done

poppunk_downgraded_clusters.csv

Describe the bug Clusters not assigned 1-251, instead 252 and up, see attached file

johnlees commented 1 day ago

Thanks for raising this.

nermze commented 1 day ago

Thanks for raising this.

  • What do you expect the clusters for these samples to be, and why?
  • Does --update-db make any difference?
  • Have you run any of the visualisation tools to see what this looks like on a tree, this could be a helpful diagnostic

Hi, running poppunk_assign with --update-db causes the program to crash. It says the db file is not found in the folder, even though it is present. Maybe its something i have missunderstood?

Poppunk_assign:

poppunk_assign --db Haemophilus_influenzae_v2_refs --query infile.txt --output poppunk_clusters --threads 8 
PopPUNK: assign
    (with backend: sketchlib v2.0.0
     sketchlib: /home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences

Graph-tools OpenMP parallelisation enabled: with 8 threads
Sketching 228 genomes using 8 thread(s)
Progress (CPU): 228 / 228
Writing sketches to file
Loading previously refined model
Completed model loading
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%
Loading network from Haemophilus_influenzae_v2_refs/Haemophilus_influenzae_v2_refs.refs_graph.gt
Network loaded: 339 samples
Found novel query clusters. Calculating distances between them.
Calculating all query-query distances
Calculating random match chances using Monte Carlo
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%

Done

Now re-running the same with --update-db (the error is the same no matter what db i specify):

poppunk_assign --db Haemophilus_influenzae_v2_refs --query infile.txt --output poppunk_clusters_db_update --threads 8 --update-db
PopPUNK: assign
    (with backend: sketchlib v2.0.0
     sketchlib: /home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences

Graph-tools OpenMP parallelisation enabled: with 8 threads
Looking for existing sketches in poppunk_clusters_db_update/poppunk_clusters_db_update.h5
Loading previously refined model
Completed model loading
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%
Loading network from Haemophilus_influenzae_v2_refs/Haemophilus_influenzae_v2_refs_graph.gt
Traceback (most recent call last):
  File "/home/bioinf/miniconda3/envs/poppunk/bin/poppunk_assign", line 11, in <module>
    sys.exit(main())
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/assign.py", line 211, in main
    assign_query(dbFuncs,
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/assign.py", line 307, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/assign.py", line 505, in assign_query_hdf5
    fetchNetwork(prev_clustering,
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/network.py", line 113, in fetchNetwork
    genomeNetwork = load_network_file(network_file, use_gpu = use_gpu)
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/network.py", line 149, in load_network_file
    genomeNetwork = gt.load_graph(fn)
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/graph_tool/__init__.py", line 3666, in load_graph
    g.load(file_name, fmt, ignore_vp, ignore_ep, ignore_gp)
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/graph_tool/__init__.py", line 3165, in load
    with open(file_name) as f: # throw the appropriate exception
FileNotFoundError: [Errno 2] No such file or directory: Haemophilus_influenzae_v2_refs/Haemophilus_influenzae_v2_refs_graph.gt'`

Poppunk_visualisation doesnt produce any results either.

johnlees commented 1 day ago

Ah right sorry, update-db won't work with the ref only fit, only the full database.

You mentioned in an email you think the v2 database might be the issue. Could you try with v1, which is available here: https://ftp.ebi.ac.uk/pub/databases/pp_dbs/Haemophilus_influenzae_v1_refs.tar.bz2

nermze commented 1 day ago

Hi, the v1 database works correctly assigning the correct taxon. So it seems the problem lies with v2.

On Fri, Oct 18, 2024 at 10:30 AM John Lees @.***> wrote:

Ah right sorry, update-db won't work with the ref only fit, only the full database.

You mentioned in an email you think the v2 database might be the issue. Could you try with v1, which is available here: https://ftp.ebi.ac.uk/pub/databases/pp_dbs/Haemophilus_influenzae_v1_refs.tar.bz2

— Reply to this email directly, view it on GitHub https://github.com/bacpop/PopPUNK/issues/333#issuecomment-2421823922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHEIIDPPL2GD4RZ7DYYRDQ3Z4DBKLAVCNFSM6AAAAABQDMORYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRRHAZDGOJSGI . You are receiving this because you authored the thread.Message ID: @.***>

-- Mvh Nermin

johnlees commented 1 day ago

Ok thanks, I'll look into this at some point soon. I assume using v1 solves your immediate issues?

nermze commented 1 day ago

Ok thanks, I'll look into this at some point soon. I assume using v1 solves your immediate issues?

Yes, v1 works flawlessly for both reference and full. We would ofcourse like to use v2 for the final publication, but if there are not many changes between them then its fine for now. Thank you for the help.