bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
89 stars 18 forks source link

AttributeError in reference clique pruning #142

Closed dsurujon closed 3 years ago

dsurujon commented 3 years ago

Versions poppunk 2.3.0.
poppunk_sketch 1.6.0.

Command used and output returned I'm working with ~1200 bacterial genomes, and have been trying multiple parameters for the model fitting. When I use dbscan it fails to find distinct clusters. I have also tried bgmm and there I can get clusters, but have a different error (below). I've pruned the samples that didn't pass QC during DB creation, So I'm not sure if this has to do with my samples or something else.

poppunk --fit-model dbscan --ref-db Ab_test --threads 40 --output Ab_test_fit --distances Ab_test/Ab_test.dists --qc-filter prune --max-a-dist 0.85 --K 3 --min-cluster-prop 0.001
/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/graph_tool/draw/cairo_draw.py:1494: RuntimeWarning: Error importing Gtk module: No module named 'gi'; GTK+ drawing will not work.
  warnings.warn(msg, RuntimeWarning)
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
    (with backend: sketchlib v1.6.0
     sketchlib: /home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 40 threads
Mode: Fitting dbscan model to reference database

Failed to find distinct clusters in this dataset
poppunk --fit-model bgmm --ref-db Ab_test --threads 40 --output Ab_test_fit --distances Ab_test/Ab_test.dists --qc-filter prune --max-a-dist 0.85 --K 3 --min-cluster-prop 0.001
/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/graph_tool/draw/cairo_draw.py:1494: RuntimeWarning: Error importing Gtk module: No module named 'gi'; GTK+ drawing will not work.
  warnings.warn(msg, RuntimeWarning)
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
    (with backend: sketchlib v1.6.0
     sketchlib: /home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 40 threads
Mode: Fitting bgmm model to reference database

Fit summary:
    Avg. entropy of assignment  0.0012
    Number of components used   3

Scaled component means:
    [0.27495647 0.42169934]
    [0.76828691 0.78236551]
    [0.02920483 0.18810149]

Network summary:
    Components  86
    Density 0.1885
    Transitivity    1.0000
    Score   0.8114
Traceback (most recent call last):
  File "/home/defne/miniconda2/envs/poppunk_env/bin/poppunk", line 10, in <module>
    sys.exit(main())
  File "/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/PopPUNK/__main__.py", line 498, in main
    extractReferences(genomeNetwork, refList, output, threads = args.threads)
  File "/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/PopPUNK/network.py", line 228, in extractReferences
    vertex_list, edge_list = gt.shortest_path(G, check[i], check[j])
  File "/home/defne/miniconda2/envs/poppunk_env/lib/python3.8/site-packages/graph_tool/topology/__init__.py", line 2153, in shortest_path
    for e in v.in_edges() if g.is_directed() else v.out_edges():
AttributeError: 'numpy.uint64' object has no attribute 'out_edges'

Describe the bug

johnlees commented 3 years ago

What are your sample names? Are they all numbers? I wonder if that might be causing the second error. Could you send me your .h5 file if not and I can try and replicate.

DBSCAN doesn't always work - you can try changing the parameters as in the docs (https://poppunk.readthedocs.io/en/latest/model_fitting.html#dbscan) But another model may be better. If you post the plots of your distance distribution and GMM fit here I can probably comment on that. What species are you looking at?

dsurujon commented 3 years ago

The samples are from Acinetobacter baumannii, and their names are alphanumeric not just numbers, most of them are the SRA accession SRRNNNNNNN.
Here's the distance plot with the clusters identified (I tried a few different values for K, 3 seemed to work best) Ab_poppunk_pruned_DPGMM_fit.
I'll try changing those parameters first. Also, I had to downgrade joblib from 1.0.0 to 0.17.0. In the documentation I see the list of dependencies, and I had the more up-to-date versions of some of those packages. I wasn't able to downgrade (e.g. pp-sketch) due to conflicts

johnlees commented 3 years ago

That fit looks pretty good to me!

Would you mind posting the output of your conda list here so I can see if there's anything obvious in terms of packages? If you are able to share your h5 file somehow (it's anonymised, doesn't contain any sequence) I'd like to try and replicate your graph tool error

dsurujon commented 3 years ago

Here's the h5 file: https://drive.google.com/file/d/1AVPjbC6aFxV6YH6H6t6Sl2SMRR3yubdp/view?usp=sharing
And here's the packages list

# packages in environment at /home/defne/miniconda2/envs/poppunk_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
boost                     1.72.0           py38h1e42940_1    conda-forge
boost-cpp                 1.72.0               h9359b55_3    conda-forge
brotlipy                  0.7.0           py38h8df0ef7_1001    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.11.0               h470a237_1    bioconda
ca-certificates           2020.12.5            ha878542_0    conda-forge
cached-property           1.5.1                      py_0    conda-forge
cairo                     1.16.0            h9f066cc_1006    conda-forge
cairomm                   1.12.2                        2    conda-forge
cairomm-1.0               1.12.2               h0069156_2    conda-forge
certifi                   2020.12.5        py38h578d9bd_0    conda-forge
cffi                      1.14.4           py38ha312104_0    conda-forge
chardet                   4.0.0            py38h578d9bd_0    conda-forge
click                     7.1.2              pyh9f0ad1d_0    conda-forge
cryptography              3.3.1            py38h2b97feb_0    conda-forge
cycler                    0.10.0                     py_2    conda-forge
decorator                 4.4.2                      py_0    conda-forge
dendropy                  4.5.1              pyh3252c3a_0    bioconda
expat                     2.2.9                he1b5a44_2    conda-forge
flask                     1.1.2              pyh9f0ad1d_0    conda-forge
flask-cors                3.0.8                      py_0    conda-forge
fontconfig                2.13.1            h7e3eb15_1002    conda-forge
freetype                  2.10.4               h7ca028e_0    conda-forge
gettext                   0.19.8.1          hf34092f_1004    conda-forge
gmp                       6.2.1                h58526e2_0    conda-forge
graph-tool                2.29             py38hcba731a_1    conda-forge
h5py                      3.1.0           nompi_py38hafa665b_100    conda-forge
hdbscan                   0.8.26           py38h0b5ebd8_3    conda-forge
hdf5                      1.10.6          nompi_h6a2412b_1113    conda-forge
icu                       67.1                 he1b5a44_0    conda-forge
idna                      2.10               pyh9f0ad1d_0    conda-forge
itsdangerous              1.1.0                      py_0    conda-forge
jinja2                    2.11.2             pyh9f0ad1d_0    conda-forge
joblib                    0.17.0                     py_0    conda-forge
jpeg                      9d                   h36c2ea0_0    conda-forge
kiwisolver                1.3.1            py38h82cb98a_0    conda-forge
krb5                      1.17.2               h926e7f8_0    conda-forge
lcms2                     2.11                 hcbb858e_1    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_1    conda-forge
libblas                   3.9.0                6_openblas    conda-forge
libcblas                  3.9.0                6_openblas    conda-forge
libcurl                   7.71.1               hcdd3856_8    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.3.0               h5dbcf3e_17    conda-forge
libgfortran-ng            9.3.0               he4bcb1c_17    conda-forge
libgfortran5              9.3.0               he4bcb1c_17    conda-forge
libglib                   2.66.3               hbe7bbb4_0    conda-forge
libgomp                   9.3.0               h5dbcf3e_17    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
liblapack                 3.9.0                6_openblas    conda-forge
libnghttp2                1.41.0               hab1572f_1    conda-forge
libopenblas               0.3.12          pthreads_h4812303_1    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libssh2                   1.9.0                hab1572f_5    conda-forge
libstdcxx-ng              9.3.0               h2ae2ef3_17    conda-forge
libtiff                   4.2.0                hdc55705_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libwebp-base              1.1.0                h36c2ea0_3    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxml2                   2.9.10               h68273f3_2    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
markupsafe                1.1.1            py38h8df0ef7_2    conda-forge
matplotlib-base           3.3.3            py38h5c7f4ab_0    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
networkx                  2.5                        py_0    conda-forge
numpy                     1.19.5           py38h18fd61f_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openblas                  0.3.12          pthreads_h04b7a96_1    conda-forge
openssl                   1.1.1i               h7f98852_0    conda-forge
pandas                    1.2.0            py38h51da96c_0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
pillow                    8.1.0            py38h357d4e7_0    conda-forge
pip                       20.3.3             pyhd8ed1ab_0    conda-forge
pixman                    0.40.0               h36c2ea0_0    conda-forge
poppunk                   2.3.0                      py_0    bioconda
pp-sketchlib              1.6.0            py38h3ac2cac_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
pycairo                   1.20.0           py38h323dad1_1    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pysocks                   1.7.1            py38h924ce5b_2    conda-forge
python                    3.8.2           he5300dc_7_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytz                      2020.5             pyhd8ed1ab_0    conda-forge
rapidnj                   2.3.2                hc9558a2_0    bioconda
readline                  8.0                  he28a2e2_2    conda-forge
requests                  2.25.1             pyhd3deb0d_0    conda-forge
scikit-learn              0.24.0           py38h658cfdd_0    conda-forge
scipy                     1.6.0            py38hb2138dd_0    conda-forge
setuptools                49.6.0           py38h924ce5b_2    conda-forge
sharedmem                 0.3.6                      py_0    bioconda
sigcpp-2.0                2.10.3               h58526e2_0    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sparsehash                2.0.2                         0    bioconda
sqlite                    3.34.0               h74cdb3f_0    conda-forge
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
tornado                   6.1              py38h25fe258_0    conda-forge
urllib3                   1.26.2             pyhd8ed1ab_0    conda-forge
werkzeug                  1.0.1              pyh9f0ad1d_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xorg-compositeproto       0.4.2                         0    conda-forge
xorg-damageproto          1.2.1             h516909a_1002    conda-forge
xorg-fixesproto           5.0               h14c3975_1002    conda-forge
xorg-inputproto           2.3.2             h14c3975_1002    conda-forge
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.10               h516909a_0    conda-forge
xorg-libsm                1.2.3             h84519dc_1000    conda-forge
xorg-libx11               1.6.12               h516909a_0    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxaw               1.0.13            h516909a_1002    conda-forge
xorg-libxcomposite        0.4.5                h516909a_0    conda-forge
xorg-libxcursor           1.2.0                h516909a_0    conda-forge
xorg-libxdamage           1.1.5                h516909a_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xorg-libxext              1.3.4                h516909a_0    conda-forge
xorg-libxfixes            5.0.3             h516909a_1004    conda-forge
xorg-libxi                1.7.10               h516909a_0    conda-forge
xorg-libxinerama          1.1.4             hf484d3e_1000    conda-forge
xorg-libxmu               1.1.3                h516909a_0    conda-forge
xorg-libxpm               3.5.13               h516909a_0    conda-forge
xorg-libxrandr            1.5.2                h516909a_1    conda-forge
xorg-libxrender           0.9.10            h516909a_1002    conda-forge
xorg-libxt                1.1.5             h516909a_1003    conda-forge
xorg-randrproto           1.5.0             h516909a_1001    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-util-macros          1.19.2            h36c2ea0_1001    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.8                ha95c52a_1    conda-forge
johnlees commented 3 years ago

I can't access the file on drive (but have sent a request for access to you)

Thinking a bit more about the fit, I think it would be worth trying fit refinement from your K = 3 fit, as that might optimise it a little further: https://poppunk.readthedocs.io/en/latest/model_fitting.html#refine

johnlees commented 3 years ago

Thanks for sharing the file. Oddly this does work for me:

python ~/Documents/PopPUNK/poppunk-runner.py --fit-model bgmm --ref-db Ab_test --output Ab_test_fit --distances Ab_test/Ab_test.dists --qc-filter prune --max-a-dist 0.85 --K 3 --min-cluster-prop 0.001
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
    (with backend: sketchlib v1.6.0
     sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)

Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting bgmm model to reference database

Fit summary:
    Avg. entropy of assignment  0.0017
    Number of components used   3

Scaled component means:
    [0.26120475 0.42521224]
    [0.75959672 0.76000775]
    [0.02938571 0.18756266]

Network summary:
    Components  84
    Density 0.1885
    Transitivity    0.9997
    Score   0.8113
Removing 1086 sequences

Done

I am using graph-tool 2.35, whereas you have 2.29. Maybe you could try upgrading with conda install graph-tool>=2.35 as that is where the error appears to be coming from?

dsurujon commented 3 years ago

That did the trick! Thank you very much for the quick response, I really appreciate it!