Closed marcelo-alvarez closed 1 year ago
Another possible solution would be to submit a request for the "workflow" scrontab queue:
https://docs.nersc.gov/jobs/workflow/workflow-queue/
Jobs in the workflow queue may request a walltime of up to 90 days and up to one quarter of the resources (CPU and/or memory) of a Perlmutter login node.
That would get you up to 128 Gb of memory.
@marcelo-alvarez let's find a way to run these tests on Perlmutter instead of turning them off. For the short term, let's try the workflow queue with 8 GB, and then file a ticket with desisim to track down what test(s) need so much memory and update them to use less memory.
The desisim unit tests are still producing OOM crashes in scrontab jobs, even with the workflow qos specified. Details below.
I obtained access for the workflow scrontab qos on perlmutter and preliminary results and scheduled the following scrontab to run as a test:
#
#- Scron job to run daily integration tests on perlmutter
#
#SCRON -A desi
#SCRON -q workflow
#SCRON -t 00:60:00
#SCRON -o ___/output-%j.out
#SCRON --open-mode=append
#SCRON --mail-type=ALL
#SCRON --mail-user=___
#SCRON --cpus-per-task=4
25 21 * * * /bin/bash -lc "source /global/common/software/desi/users/desi/perlmutter/code/desitest/main/etc/cron_dailyupdate.sh"
I believe the combination of setting -q workflow
and --cpus-per-task=4
should have ensured there were 8 GB of memory available. However, it still crashes at the desisim unit test step with OOM errors, as indicated in an email message from root@nersc.gov.
Running
source /global/common/software/desi/users/desi/perlmutter/code/desitest/main/etc/cron_dailyupdate.sh
on a perlmutter login node as the desi user succeeds, and profiling individual unit test commands for each update with Arm (i.e. with modified commands
'perf-report -o $SCRATCH/dailyupdatetest/{}.txt python setup.py test'.format(repo)
in desitest.nersc.update
for repo in desiutil, specter, etc.) indicates desisim uses only ~3 GB of memory (see the 'profiling summary' text copied below), which is less than the 8 GB that should have been available for the scrontab job with -q workflow
and --cpus-per-task=4
set.
From these two tests above (same script from either a scrontab with -q workflow
and --cpus-per-task=4
set or instead source d from a login node), it's not clear that it's just a memory limit issue. It is possible it is something specific to desisim unit tests on a scrontab job that would happen even if there was no memory limit at all. I have also found that OOM errors occur even when the scrontab runs only cd .../desisim/main; python setup.py test
.
profiling summary:
desiutil: 646 MiB
specter: 1.09 GiB
gpu_specter: 2.43 GiB
desimodel: 890 MiB
desitarget: 1.23 GiB
desispec: 1005 MiB
specsim: 131 MiB
desisim: 2.86 GiB
desisurvey: 790 MiB
surveysim: 565 MiB
redrock: 448 MiB
After an overhaul of how desitest is run nightly at Perlmutter, desisim tests are now running on Perlmutter with no memory failures, so this issue has been resolved.
Scrontab that runs the daily test fails at the desisim step with OOM, causing the entire job to crash before completing subsequent updates. Likely reason is that memory is limited to 4 GB on login nodes on which scrontab jobs run. The cpu limit (and probably corresponding memory limit) is set by the "cron" qos. Thanks @dmargala for helping investigate this.
Some possible solutions:
I have implemented (1) above already at
i.e.:
@sbailey please have a look and confirm that this solution makes sense for daily updates going forward (it would have the added benefit of speeding up the updates). If so, I can merge this change into main.