phasing workflow: insufficient memory for Java Runtime Environment to continue

jbernst commented 1 year ago

Hi all,

I know this issue (or at least a similar one) has been discussed in issue #222, but I decided to make a separate post since my error messages are a bit different.

I am attempting to phase some UCEs (5k probe set) in Phyluce, following the Workflows tutorial. I have successfully performed the Mapping part of the workflow.

Upon the Phasing step, I am running into error messages related to available memory. I am using our cluster with nodes that have 182 GB RAM and 28 cores available. This is trying to run only 5-10 samples at a time and a single specimen could take hours to run. I didn't get the same error message in issue #222 but rather I got this at the end of the hs_err_pid#.log file:

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 9110028288 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2877), pid=257889, tid=257901
#
# JRE version: OpenJDK Runtime Environment (11.0.8) (build 11.0.8-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (11.0.8-internal+0-adhoc..src, mixed mode, tiered, g1 gc, linux-amd64)
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#

In the .err files from my SLURM outputs, I see (across a few different job submissions): Error in rule pilon_allele_0: or Error in rule pilon_allele_1: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f83d1000000, 6660554752, 0) failed; error='Not enough space' (errno=12).

I checked and my default memory is set to default_jvm_mem_opts = ['-Xms512m', '-Xmx100g'] in the pilon file. This seems to have worked for others but mine is still getting stuck an hour or two into jobs. I ran a single individual and it took a really long time but worked. Based on the .log file, am I using too much memory or too many cores?

One last question. If some samples ran to complete but not others, can I consider those good and then just continue with the ones that did not work? I saw in the other issue it was mentioned pilon can pick up where it left off?

Let me know if there is any other information I should provide. Thank you!

Best, Justin

brantfaircloth commented 1 year ago

The problem is caused by too little RAM, as indicated in the error messages. You can increase RAM for each individual you are trying to run or you could consider downsampling reads for each individual. Also be sure you are not running multiple individuals on a single node - which would greatly reduce the total RAM available.

jbernst commented 1 year ago

Thank you! Would increasing the RAM for each individual be done by decreasing the number of individuals in my .conf file? Currently I have anywhere between 10-20 in each file. Though, some runs with even 4 samples failed too.

For the sake of my own knowledge, are individuals run at the same time? I thought that each sample was run one after the other so all of the memory would be allocated to each sample (i.e., once one finishes, the next sample is than run).

brantfaircloth commented 1 year ago

It should be running one sample at a time - but you could also have setup the slurm script in such a way that multiple jobs are being run. You could potentially login to the node the job is running on (once the job starts) and take a look at what's happening on that particular node - it may help to decipher what's happening. Also, ensure that your jobs are the only jobs running on any particular node (again, your HPC setup and slurm script determine this).

jbernst commented 1 year ago

I have been using the entire nonpre nodes so that my job is the only job running. Even with only two samples, I am getting the same OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f3a42000000, 6593445888, 0) failed; error='Not enough space' (errno=12) warning when allocating 182 GB of RAM and 28 cores to just the single job. Does that have to do with storage space or is this the RAM that is required to run the job? And would I need to change default_jvm_mem_opts = ['-Xms512m', '-Xmx100g'] to reflect a higher number?

I am currently trying to see how to log into our compute nodes using SLURM (if you have tips, feel free to let me know).

Thank you again for your quick responses and all of the help you always give!

jbernst commented 1 year ago

I am going to close this. I got it to run by running 1-2 samples at a time. Though, some samples, even if run one at a time, didn't fail to memory but due to time limit (72 hours) on the cluster. I will sub sample these.

Cheers!

faircloth-lab / phyluce

phasing workflow: insufficient memory for Java Runtime Environment to continue #293