bacpop / PopPUNK

PopPUNK πŸ‘¨β€πŸŽ€ (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
88 stars 18 forks source link

fit-model RecursionError #186

Closed ChadFibke closed 2 years ago

ChadFibke commented 2 years ago

Hi @johnlees, Thanks for making this interesting program! I'm currently running into some issues stated below:

Versions

  poppunk 2.4.0
  poppunk_sketch 1.7.4 

I ran the following command to generate the database:

Describe the bug

poppunk \
    --strand-preserved \
    --create-db \
    --output ../output/poppunk_database \
    --r-files ../input/rlist.txt \
    --qc-filter prune \
    --threads 8

This constructed all the expected files and the following distance distribution plot:

poppunk

I then fit a model to the database using the following command:

poppunk \
    --fit-model dbscan \
    --threads 2 \
    --ref-db ../output/poppunk_database \
    --output ../output/poppunk_database

The following output was received:

PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
    (with backend: sketchlib v1.7.4
     sketchlib: /home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 2 threads
Mode: Fitting dbscan model to reference database

Selected type isolate for distance QC is 
Assigning distances with DBSCAN model
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 273315/273315 [17:41:45<00:00,  4.29it/s]  
Fit summary:
    Number of clusters  55
    Number of datapoints    100000
    Number of assignments   45418

Scaled component means
    [0.44485143 0.54643321]
    [0.08999523 0.16832361]
    [0.61683899 0.33839482]
    [0.590527   0.16784331]
    [0.3090418  0.01128705]
    [0.41989657 0.0177231 ]
    [0.40097317 0.01952238]
    [0.46180746 0.02018453]
    [0.48417991 0.01629377]
    [0.50239229 0.01412444]
    [0.68350607 0.0176661 ]
    [0.66616762 0.01791641]
    [0.52097344 0.01263184]
    [0.5921765  0.00824973]
    [0.61494625 0.00861954]
    [2.31415913e-01 2.68792064e-05]
    [1.49078533e-01 2.35816424e-05]
    [0.07127412 0.00010718]
    [3.08908999e-01 1.67031740e-05]
    [3.83228034e-01 9.49449532e-06]
    [3.46794635e-01 1.63410405e-05]
    [3.64722282e-01 1.21551366e-05]
    [4.16302949e-01 6.54760288e-06]
    [4.38891977e-01 1.87010539e-06]
    [4.55783635e-01 1.20976370e-06]
    [4.73785698e-01 4.05198222e-07]
    [0.7829355 0.       ]
    [0.77276397 0.        ]
    [0.76216906 0.        ]
    [0.76741725 0.        ]
    [7.25914896e-01 2.41283004e-07]
    [4.83551025e-01 8.55768633e-07]
    [6.84362352e-01 6.78959111e-07]
    [4.96502906e-01 5.68464031e-07]
    [5.06929219e-01 1.09953184e-07]
    [6.74612582e-01 8.46836699e-08]
    [0.67062086 0.        ]
    [6.63987517e-01 2.01074926e-07]
    [0.65565485 0.        ]
    [5.15947104e-01 3.08592348e-07]
    [5.90514898e-01 1.69901980e-07]
    [6.11604810e-01 2.93487062e-07]
    [6.26271248e-01 3.35471071e-07]
    [6.39384091e-01 3.14026494e-07]
    [0.64804345 0.        ]
    [0.64348441 0.        ]
    [5.20000637e-01 4.31512149e-07]
    [5.24051130e-01 1.38705673e-07]
    [0.52881938 0.        ]
    [5.36367476e-01 4.17425419e-08]
    [6.32131875e-01 2.37707894e-07]
    [0.63606 0.     ]
    [5.71722984e-01 1.29910333e-07]
    [5.54656744e-01 7.86601575e-08]
    [5.44970810e-01 1.46818904e-08]

Network summary:
    Components              5246
    Density                 0.0267
    Transitivity                0.6393
    Mean betweenness            0.4573
    Weighted-mean betweenness       0.1429
    Score                   0.6222
    Score (w/ betweenness)          0.3377
    Score (w/ weighted-betweenness)     0.5333
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/PopPUNK/network.py", line 209, in cliquePrune
    ref_list = getCliqueRefs(subgraph, refs)
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/PopPUNK/network.py", line 190, in getCliqueRefs
    getCliqueRefs(subgraph, reference_indices)
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/PopPUNK/network.py", line 190, in getCliqueRefs
    getCliqueRefs(subgraph, reference_indices)
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/PopPUNK/network.py", line 190, in getCliqueRefs
    getCliqueRefs(subgraph, reference_indices)
  [Previous line repeated 962 more times]
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/PopPUNK/network.py", line 188, in getCliqueRefs
    subgraph = gt.GraphView(G, vfilt=[v not in clique for v in G.vertices()])
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 3565, in __init__
    ef[0] = self.own_property(ef[0].copy())
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 400, in copy
    return self.get_graph().copy_property(self, value_type=value_type,
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/decorators.py", line 100, in wrapper
    return f(*args, **kwargs)
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/decorators.py", line 100, in wrapper
    return f(*args, **kwargs)
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 2787, in copy_property
    tgt = self.new_property(src.key_type(),
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 2693, in new_property
    return self.new_edge_property(value_type, vals)
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 2725, in new_edge_property
    prop = EdgePropertyMap(new_edge_property(_type_alias(value_type),
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 884, in __init__
    PropertyMap.__init__(self, pmap, g, "e")
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 348, in __init__
    self.__convert = _converter(self.value_type())
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 264, in _converter
    vtype = _python_type(val_type)
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 223, in _python_type
    type_name = _type_alias(type_name)
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 212, in _type_alias
    if type_name in value_types():
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/graph_tool/__init__.py", line 3639, in value_types
    return libcore.get_property_types()
RecursionError: maximum recursion depth exceeded while calling a Python object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/chad.fibke/.conda/envs/POPpunk/bin/poppunk", line 11, in <module>
    sys.exit(main())
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/PopPUNK/__main__.py", line 550, in main
    extractReferences(genomeNetwork,
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/site-packages/PopPUNK/network.py", line 339, in extractReferences
    ref_lists = pool.map(partial(cliquePrune,
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/chad.fibke/.conda/envs/POPpunk/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
RecursionError: maximum recursion depth exceeded while calling a Python object

I checked the output location and all of the expected files were writen to file and seem complete (_clusters.csv, _unword_clusters.csv, _graph.gt). In addition the following HDBSCAN cluster plot was provided:

poppunk_database_dbscan

There appears to be poor clustering in the initial model, so I'm currently running a poppunk --fit-model refine command.

However, I was wondering if the recursion issue is interfering with the integrity of the clustering and whether it is safe to continue with refining the model?

Best, Chad

johnlees commented 2 years ago

Hi, thanks for asking about this case.

It looks to me like none of the GMM, DBSCAN or refine models would fit well to this data, based on that distance distribution plot. There aren't distinct components, and specifically not one near the origin separated from others.

First thing to check - are these core/accessory distances what you expect? What species are you looking at. Is it perhaps within a strain, or a less diverse virus?

I think there are three things you could try here:

ChadFibke commented 2 years ago

Hi @johnlees,

I'm working with SARS-CoV-2. The input are consensus sequences (superimposing called variants onto the Wuhan reference). I did not expect to see a lot of variation, and was surprised there was variation in the accessory.

Thank you for all the options! I successfully ran the lineage model, and it seemed to have satisfied our purposes (to control for population structure in an elastic net model). I will also look into PopPIPE!

Best, Chad

johnlees commented 2 years ago

Great! Glad it worked