scron desisim daily update OOM failures on perlmutter

marcelo-alvarez commented 2 years ago

Scrontab that runs the daily test fails at the desisim step with OOM, causing the entire job to crash before completing subsequent updates. Likely reason is that memory is limited to 4 GB on login nodes on which scrontab jobs run. The cpu limit (and probably corresponding memory limit) is set by the "cron" qos. Thanks @dmargala for helping investigate this.

Some possible solutions:

don't do desisim unit tests on perlmutter
modify desisim unit tests to use < 4 GB
increase the scrontab memory limit, if possible

I have implemented (1) above already at

/global/common/software/desi/users/desi/perlmutter/code/desitest/main/py/desitest/nersc.py

i.e.:

-            #- desisim-testdata & redrock-templates: data only, no tests
-            if repo in ['desisim-testdata', 'redrock-templates']:
+            if repo in ['desisim-testdata', 'desisim', 'redrock-templates']:
+            #- desisim-testdata, desisim & redrock-templates: data only, no tests

@sbailey please have a look and confirm that this solution makes sense for daily updates going forward (it would have the added benefit of speeding up the updates). If so, I can merge this change into main.

dmargala commented 2 years ago

Another possible solution would be to submit a request for the "workflow" scrontab queue:

https://docs.nersc.gov/jobs/workflow/workflow-queue/

Jobs in the workflow queue may request a walltime of up to 90 days and up to one quarter of the resources (CPU and/or memory) of a Perlmutter login node.

That would get you up to 128 Gb of memory.

sbailey commented 2 years ago

@marcelo-alvarez let's find a way to run these tests on Perlmutter instead of turning them off. For the short term, let's try the workflow queue with 8 GB, and then file a ticket with desisim to track down what test(s) need so much memory and update them to use less memory.

marcelo-alvarez commented 1 year ago

The desisim unit tests are still producing OOM crashes in scrontab jobs, even with the workflow qos specified. Details below.

I obtained access for the workflow scrontab qos on perlmutter and preliminary results and scheduled the following scrontab to run as a test:

#
#- Scron job to run daily integration tests on perlmutter
#
#SCRON -A desi
#SCRON -q workflow
#SCRON -t 00:60:00
#SCRON -o ___/output-%j.out
#SCRON --open-mode=append
#SCRON --mail-type=ALL
#SCRON --mail-user=___
#SCRON --cpus-per-task=4
25 21 * * * /bin/bash -lc "source /global/common/software/desi/users/desi/perlmutter/code/desitest/main/etc/cron_dailyupdate.sh"

I believe the combination of setting -q workflow and --cpus-per-task=4 should have ensured there were 8 GB of memory available. However, it still crashes at the desisim unit test step with OOM errors, as indicated in an email message from root@nersc.gov.

Running

source /global/common/software/desi/users/desi/perlmutter/code/desitest/main/etc/cron_dailyupdate.sh

on a perlmutter login node as the desi user succeeds, and profiling individual unit test commands for each update with Arm (i.e. with modified commands

'perf-report -o $SCRATCH/dailyupdatetest/{}.txt python setup.py test'.format(repo)

in desitest.nersc.update for repo in desiutil, specter, etc.) indicates desisim uses only ~3 GB of memory (see the 'profiling summary' text copied below), which is less than the 8 GB that should have been available for the scrontab job with -q workflow and --cpus-per-task=4 set.

From these two tests above (same script from either a scrontab with -q workflow and --cpus-per-task=4 set or instead source d from a login node), it's not clear that it's just a memory limit issue. It is possible it is something specific to desisim unit tests on a scrontab job that would happen even if there was no memory limit at all. I have also found that OOM errors occur even when the scrontab runs only cd .../desisim/main; python setup.py test.

profiling summary:

desiutil:     646   MiB
specter:      1.09  GiB
gpu_specter:  2.43  GiB
desimodel:    890   MiB
desitarget:   1.23  GiB
desispec:     1005  MiB
specsim:      131   MiB
desisim:      2.86  GiB
desisurvey:   790   MiB
surveysim:    565   MiB
redrock:      448   MiB

marcelo-alvarez commented 1 year ago

After an overhaul of how desitest is run nightly at Perlmutter, desisim tests are now running on Perlmutter with no memory failures, so this issue has been resolved.

desihub / desitest

scron desisim daily update OOM failures on perlmutter #43