Maximum number of genomes tested?

dutchscientist commented 7 years ago

I am doing a Scoary test with a 5,829 genome Roary file (~250 Mb) and a custom tree. It works fine in the beginning, but crashes (out of memory?) when storing the pairs. The server I use is Ubuntu 14.04 LTS (Biolinux 8) with 4 core-Xeon processor (8 threads) and 32 Gb RAM and 32 Gb swap.

Is there a maximum to the number of genomes for Scoary?

AdmiralenOla commented 7 years ago

There's no limit by design at least. Could be a memory issue. The largest data set I've run with was around 3,100 genomes, and that worked fine. Are you getting an error message?

dutchscientist commented 7 years ago

Just that Python (2.7) has crashed. If you want I can get the full message tonight.

It stops when counting the pairs, which is something I am not really interested in. Is it possible to instruct Scoary only to do the p-value and report that back?

AdmiralenOla commented 7 years ago

Not currently possible, but that would be a useful addition that I will put in the next version for sure.

A possible workaround: Set a low p-value as threshold and invoke just the Individual filtration measure. Scoary will only calculate pairwise comparisons for genes with naïve p-values lower than the threshold, potentially saving a lot of memory.

dutchscientist commented 7 years ago

Ah, that is an interesting suggestion, will try that out :+1:

Was also planning to make subsets of the data, to see where it crashes.

dutchscientist commented 7 years ago

Even with p=1E-50, still no joy. This is the error:

Storing results: ST45 Calculating max number of contrasting pairs for each nominally significant gene 100.00%Traceback (most recent call last): File "/usr/local/bin/scoary", line 11, in load_entry_point('scoary==1.6.9', 'console_scripts', 'scoary')() File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 244, in main delimiter=args.delimiter) File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 813, in StoreResults num_threads, no_time, delimiter) File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 920, in StoreTraitResult Threadresults = list(Threadresults) File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next raise value RuntimeError: maximum recursion depth exceeded while calling a Python object

dutchscientist commented 7 years ago

Gone down to 1E-200 and let Scoary make its own tree, then it works. Now trying to go via 1E-100 until it breaks again.

dutchscientist commented 7 years ago

OK, it is a memory issue. If I let Scoary make the tree, I can get the 250 Mb file to be analysed down to 1E-10. I then used a file double the size (same setting, but no paralog clustering in Roary), and that one crashes at 1E-100, and when I check the memory, then both the 32 Gb RAM and swap are full.

AdmiralenOla commented 7 years ago

So to sum up there's at least three things for me to do here:

Implementing a "summary statistics only" mode that skips the pairwise comparisons algorithm.
Rewriting code to be less memory-intensive. Currently a lot of metrics are stored in memory and only written to file at the end of analysis. It would probably be possible to improve this by writing to temporary files, destroying objects when they are no longer needed etc.
An investigation into why letting Scoary make the tree has an impact on memory consumption. I have no clue as to why that matters.

I hope to be done with 1 fairly quickly, but 2 & 3 might take a bit longer (several months).

dutchscientist commented 7 years ago

Hi Ola,

don't worry too much about it! I thought it would be fun to push Roary and Scoary a bit with a very large dataset, but not sure whether people will really use such datasets, or if they do, whether they have a lot more power than my home-setup.

I am using this as a testground, but will probably make the set smaller by using representatives of the groups and by making smaller subgroups. The -r/-w options of Scoary are great for that, as it makes a smaller Roary set rather than having to rerun Roary every time. :)

AdmiralenOla commented 7 years ago

The --no_pairwise option is now implemented in the latest version (1.6.11). This is a solution to problem 1 referenced above. I will still have to fix the maximum recursion depth problem, but I'm moving that to a separate issue.

dutchscientist commented 7 years ago

Thanks!

dutchscientist commented 7 years ago

Just a comment: the explanatory text has not been updated to include the option? Am about to try it!

AdmiralenOla commented 7 years ago

You mean in the Readme? Yeah, that still has the help text for a previous version (1.6.10). But in the actual script the explanatory text (as seen using -h) should be included.

dutchscientist commented 7 years ago

Yes, that's right, I meant the website.

Am very happy with the new version, that does exactly what I want (the identification of the differentially represented genes), also with the large dataset, and is very quick now :)

AdmiralenOla commented 7 years ago

Thanks for submitting an issue and for your very useful suggestion! :-)

jambler24 commented 3 years ago

Also came across this issue, 700 genomes but a large traits file.

Running on the cluster: slurmstepd: error: Detected 1 oom-kill event(s) in step 1088755.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

AdmiralenOla / Scoary

Maximum number of genomes tested? #53