Open mdebiasse opened 3 years ago
build_lmer_table failed. Exit code 35072 slurmstepd: error: Detected 1 oom-kill event(s) in step 2092087.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
This one is a pretty clear-cut out-of-memory error, but you might not need to change any files to try and troubleshoot it (and we never did find out if that attempted fix worked, anyway). RepeatModeler uses a sampling approach, so it is possible that you got an "unlucky" sequence sample in the first round that is more memory-intensive than normal. It is also possible that other jobs were also using too much memory on the same compute node, and yours might have run just fine if it was not competing with other jobs for resources.
If you haven't already, I would first try running RepeatModeler again - possibly on a different compute node with more memory, or at a less busy time of day, if that's an option. If that doesn't work or if you have already tried a few times and they ended in out-of-memory, I'll look into workarounds for editing the scripts to try that approach next.
Thank you for the reply! I requested an exclusive node with 250G and I think this solved the problem- unfortunately, the run timed out, but the program got past the point where it failed before. I just relaunched with a longer wall time.
Good morning, I am getting an error message with version 2.0.1 (full out file below):
build_lmer_table failed. Exit code 35072
I am running the program with singularity. This post (https://github.com/Dfam-consortium/RepeatModeler/issues/27) suggests editing the RepeatModeler script to lower the sample size for RepeatScout, but as I understand it, I can't access the scripts outside of the environment since the script builds the environment on the fly and there is no static image being deployed or recalled. Therefore, Im not sure how to access the RepeatModeler script for editing outside of the environment. Any advice is appreciated!