AttributeError in reference clique pruning #142

Closed dsurujon closed 3 years ago

dsurujon commented 3 years ago

Versions poppunk 2.3.0.
poppunk_sketch 1.6.0.

Command used and output returned I'm working with ~1200 bacterial genomes, and have been trying multiple parameters for the model fitting. When I use dbscan it fails to find distinct clusters. I have also tried bgmm and there I can get clusters, but have a different error (below). I've pruned the samples that didn't pass QC during DB creation, So I'm not sure if this has to do with my samples or something else.

poppunk --fit-model dbscan --ref-db Ab_test --threads 40 --output Ab_test_fit --distances Ab_test/Ab_test.dists --qc-filter prune --max-a-dist 0.85 --K 3 --min-cluster-prop 0.001
/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/graph_tool/draw/ RuntimeWarning: Error importing Gtk module: No module named 'gi'; GTK+ drawing will not work.
  warnings.warn(msg, RuntimeWarning)
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
    (with backend: sketchlib v1.6.0
     sketchlib: /home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/

Graph-tools OpenMP parallelisation enabled: with 40 threads
Mode: Fitting dbscan model to reference database

Failed to find distinct clusters in this dataset
poppunk --fit-model bgmm --ref-db Ab_test --threads 40 --output Ab_test_fit --distances Ab_test/Ab_test.dists --qc-filter prune --max-a-dist 0.85 --K 3 --min-cluster-prop 0.001
/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/graph_tool/draw/ RuntimeWarning: Error importing Gtk module: No module named 'gi'; GTK+ drawing will not work.
  warnings.warn(msg, RuntimeWarning)
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
    (with backend: sketchlib v1.6.0
     sketchlib: /home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/

Graph-tools OpenMP parallelisation enabled: with 40 threads
Mode: Fitting bgmm model to reference database

Fit summary:
    Avg. entropy of assignment  0.0012
    Number of components used   3

Scaled component means:
    [0.27495647 0.42169934]
    [0.76828691 0.78236551]
    [0.02920483 0.18810149]

Network summary:
    Components  86
    Density 0.1885
    Transitivity    1.0000
    Score   0.8114
Traceback (most recent call last):
  File "/home/defne/miniconda2/envs/poppunk_env/bin/poppunk", line 10, in <module>
  File "/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/PopPUNK/", line 498, in main
    extractReferences(genomeNetwork, refList, output, threads = args.threads)
  File "/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/PopPUNK/", line 228, in extractReferences
    vertex_list, edge_list = gt.shortest_path(G, check[i], check[j])
  File "/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/graph_tool/topology/", line 2153, in shortest_path
    for e in v.in_edges() if g.is_directed() else v.out_edges():
AttributeError: 'numpy.uint64' object has no attribute 'out_edges'

Describe the bug

johnlees commented 3 years ago

What are your sample names? Are they all numbers? I wonder if that might be causing the second error. Could you send me your .h5 file if not and I can try and replicate.

DBSCAN doesn't always work - you can try changing the parameters as in the docs ( But another model may be better. If you post the plots of your distance distribution and GMM fit here I can probably comment on that. What species are you looking at?

dsurujon commented 3 years ago

The samples are from Acinetobacter baumannii, and their names are alphanumeric not just numbers, most of them are the SRA accession SRRNNNNNNN.
Here's the distance plot with the clusters identified (I tried a few different values for K, 3 seemed to work best) Ab_poppunk_pruned_DPGMM_fit.
I'll try changing those parameters first. Also, I had to downgrade joblib from 1.0.0 to 0.17.0. In the documentation I see the list of dependencies, and I had the more up-to-date versions of some of those packages. I wasn't able to downgrade (e.g. pp-sketch) due to conflicts

johnlees commented 3 years ago

That fit looks pretty good to me!

Would you mind posting the output of your conda list here so I can see if there's anything obvious in terms of packages? If you are able to share your h5 file somehow (it's anonymised, doesn't contain any sequence) I'd like to try and replicate your graph tool error

dsurujon commented 3 years ago

Here's the h5 file:
And here's the packages list

johnlees commented 3 years ago

I can't access the file on drive (but have sent a request for access to you)

Thinking a bit more about the fit, I think it would be worth trying fit refinement from your K = 3 fit, as that might optimise it a little further:

johnlees commented 3 years ago

Thanks for sharing the file. Oddly this does work for me:

python ~/Documents/PopPUNK/ --fit-model bgmm --ref-db Ab_test --output Ab_test_fit --distances Ab_test/Ab_test.dists --qc-filter prune --max-a-dist 0.85 --K 3 --min-cluster-prop 0.001
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
    (with backend: sketchlib v1.6.0
     sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/

Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting bgmm model to reference database

Fit summary:
    Avg. entropy of assignment  0.0017
    Number of components used   3

Scaled component means:
    [0.26120475 0.42521224]
    [0.75959672 0.76000775]
    [0.02938571 0.18756266]

Network summary:
    Components  84
    Density 0.1885
    Transitivity    0.9997
    Score   0.8113
Removing 1086 sequences


I am using graph-tool 2.35, whereas you have 2.29. Maybe you could try upgrading with conda install graph-tool>=2.35 as that is where the error appears to be coming from?

dsurujon commented 3 years ago

That did the trick! Thank you very much for the quick response, I really appreciate it!