bacpop / ggCaller

Bifrost graph gene caller.
MIT License
86 stars 6 forks source link

Problem with larger dataset? #4

Closed pangenomics closed 1 year ago

pangenomics commented 1 year ago

Hi Sam and everyone,

Great tool! I really enjoyed reading the pre-preprint. So many good ideas going into this. I've had a few issues though but installing from source solved almost all of them. The following is remaining though: I ran ggcaller on ca 550 genomes (it works with a subset of 20) but consistently ran into the following error. Is there anything I can do to solve this?

Traceback (most recent call last): File "/home/dmende/miniconda3/envs/ggc_env/bin/ggcaller", line 33, in <module> sys.exit(load_entry_point('ggCaller==1.3.3', 'console_scripts', 'ggcaller')()) File "/home/dmende/miniconda3/envs/ggc_env/lib/python3.9/site-packages/ggCaller-1.3.3-py3.9-linux-x86_64.egg/ggCaller/__main__.py", line 411, in main graph_tuple = graph.build(options.refs, options.kmer, stop_codons_for, stop_codons_rev, start_codons_for, IndexError: vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 1089331)

samhorsfield96 commented 1 year ago

Hi, this may be an issue with memory allocation. ggCaller memory usage scales with graph complexity, and so a highly diverse dataset may cause issues if memory is limited. Could you try running with half of the dataset and seeing what happens?

pangenomics commented 1 year ago

Hi Sam, I will do that and let you know. It might take a few days though.

pangenomics commented 1 year ago

Hi Sam, I ran on a few datasets. It seems that somewhere around 200 genomes (each around 5Mbp) ggcaller starts giving me the above error. 100 always works, some sets of 200 genomes fail and 300 genomes sets always fail. I checked the memory usage and indeed that seems to be the problem. I'll see if I can find a machine with a bit more RAM to test more.