Closed willaguiar closed 1 year ago
What's your run directory again?
What's your run directory again?
/g/data/v45/wf4500/panan0025/panana0025_ryf_26052023/mom6-panan/
mom6.err:
Finished initializing diag_manager at 20230605 105958.403
Starting to initialize tracer_manager at 20230605 105959.543
Finished initializing tracer_manager at 20230605 105959.694
Beginning to initialize component models at 20230605 105959.695
Starting to initialize atmospheric model at 20230605 105959.730
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
....
--------------------------------------------------------------------------
mpirun noticed that process rank 8940 with PID 703124 on node gadi-cpu-spr-0343 exited on signal 9 (Killed).
--------------------------------------------------------------------------
It doesn't seem to be giving any useful kind of error, but it's crashing repeatably during initialisation. Any ideas @angus-g?
@willaguiar's first theory was correct: it's hitting the memory limit. I'm not sure why it would suddenly do that without any external changes though?
I checked the logs for the latest successful runs, and they seem to have the similar memory usage . on panan_0025.o86213121:
Memory Requested: 46.87TB Memory Used: 40.35TB
Walltime requested: 10:00:00 Walltime Used: 06:29:47
JobFS requested: 100.0GB JobFS used: 12.32MB
This might be useful info. could the memory limits of normalsr been downgraded? I haven't changed any inputs or configs other than the number of ncpus used for collating the output (increased from 4 to 8)
could the memory limits of normalsr been downgraded
Not really, no. It's somewhat close to the limit, but it could be that one node is unbalanced vs. the others and it just bumps over...
So what's the solution here? Do we try harder to figure out what's causing the memory increase? Or do we need to come up with a new layout? Or?
On Mon, 5 Jun 2023 at 11:56, Angus Gibson @.***> wrote:
could the memory limits of normalsr been downgraded
Not really, no. It's somewhat close to the limit, but it could be that one node is unbalanced vs. the others and it just bumps over...
— Reply to this email directly, view it on GitHub https://github.com/COSIMA/mom6-panan/issues/47#issuecomment-1575923679, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA44U4J7SP3QNSWGZOHAZDXJU4F5ANCNFSM6AAAAAAY2JI3YY . You are receiving this because you commented.Message ID: @.***>
~I'm not able to reproduce this, it might be a transient Gadi error, since~ Wilton is able to get past this stage (although with a different random crash).
Spoke too soon, I did reproduce the initial error, but with a traceback: https://gist.github.com/angus-g/bda8bd0f8e75dd642f233a7713da4e83
~I'm not able to reproduce this, it might be a transient Gadi error, since~ Wilton is able to get past this stage (although with a different random crash).
Spoke too soon, I did reproduce the initial error, but with a traceback: https://gist.github.com/angus-g/bda8bd0f8e75dd642f233a7713da4e83
My second crash (code 135) seemed to be a random problem in Gadi (a lot of Gadi nodes were down for a few minutes, and Gadi was offline. After going back online the run was killed without a mom6.out error, but with code 135 on pbs log).
I tried running again with the extra 5 GB memory, and the model now finished initialization and a first timestep.
Ok, simulation ran, but it was extremely slow and exceeded walltime before finishing it - ugh (panan_0025.o86502497). trying to re-run gives the same 137 error and memory values reaching the limit ( even with the extra 5GB now).
Should we look for a way to ask for extra memory limit? or should we try to use more cores for the run?
Update: I created a new simulation from the Github repo, and tried to run it for 1 day from IC (dir = /home/156/wf4500/v45_wf4500/panan0025/panan_fresh_Start_test/mom6-panan
). The model crashed with the same error. The error again appears right after starting the Atmospheric model (mom6.err)
Beginning to initialize component models at 20230606 120108.960
Starting to initialize atmospheric model at 20230606 120109.010
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
So maybe there is something inconsistent/changed in the repo, or input files, or something different on sapphire nodes? @adele157 @angus-g
something inconsistent/changed in the repo
nope, looks like everything is the same between the successful and failing runs
input files
nope, no changes in the manifest.yaml
something different on sapphire nodes
Seems the most likely: something system-related. Might be worth raising a ticket with NCI and saying that you're seeing jobs killed due to exceeding memory usage that were otherwise fine before.
Hello everyone.
I was running panan1/40th fine for the first few months. Now the run is starting but crashing at the beginning.
payu: Model exited with error code 137; aborting.
Checking the pbs output, it shows that memory usage is close to the limit
Memory Requested: 46.87TB Memory Used: 40.93TB
This wasn't a in the first 5 months, and I have not changed anything in the configs after that. Any idea what could it be/how to fix it?