COSIMA / mom6-panan

Pan-Antarctic regional configuration of MOM6
MIT License
6 stars 6 forks source link

Panan 1/40th crashing during initialisation #47

Closed willaguiar closed 1 year ago

willaguiar commented 1 year ago

Hello everyone.

I was running panan1/40th fine for the first few months. Now the run is starting but crashing at the beginning.

payu: Model exited with error code 137; aborting.

Checking the pbs output, it shows that memory usage is close to the limit

Memory Requested: 46.87TB Memory Used: 40.93TB

This wasn't a in the first 5 months, and I have not changed anything in the configs after that. Any idea what could it be/how to fix it?

adele-morrison commented 1 year ago

What's your run directory again?

willaguiar commented 1 year ago

What's your run directory again?

/g/data/v45/wf4500/panan0025/panana0025_ryf_26052023/mom6-panan/

willaguiar commented 1 year ago

mom6.err:

Finished initializing diag_manager at 20230605 105958.403
 Starting to initialize tracer_manager at 20230605 105959.543
 Finished initializing tracer_manager at 20230605 105959.694
 Beginning to initialize component models at 20230605 105959.695
 Starting to initialize atmospheric model at 20230605 105959.730
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
....

--------------------------------------------------------------------------
mpirun noticed that process rank 8940 with PID 703124 on node gadi-cpu-spr-0343 exited on signal 9 (Killed).
--------------------------------------------------------------------------
adele-morrison commented 1 year ago

It doesn't seem to be giving any useful kind of error, but it's crashing repeatably during initialisation. Any ideas @angus-g?

angus-g commented 1 year ago

@willaguiar's first theory was correct: it's hitting the memory limit. I'm not sure why it would suddenly do that without any external changes though?

willaguiar commented 1 year ago

I checked the logs for the latest successful runs, and they seem to have the similar memory usage . on panan_0025.o86213121:

   Memory Requested:   46.87TB               Memory Used: 40.35TB         
   Walltime requested: 10:00:00            Walltime Used: 06:29:47        
   JobFS requested:    100.0GB                JobFS used: 12.32MB 

This might be useful info. could the memory limits of normalsr been downgraded? I haven't changed any inputs or configs other than the number of ncpus used for collating the output (increased from 4 to 8)

angus-g commented 1 year ago

could the memory limits of normalsr been downgraded

Not really, no. It's somewhat close to the limit, but it could be that one node is unbalanced vs. the others and it just bumps over...

adele-morrison commented 1 year ago

So what's the solution here? Do we try harder to figure out what's causing the memory increase? Or do we need to come up with a new layout? Or?

On Mon, 5 Jun 2023 at 11:56, Angus Gibson @.***> wrote:

could the memory limits of normalsr been downgraded

Not really, no. It's somewhat close to the limit, but it could be that one node is unbalanced vs. the others and it just bumps over...

— Reply to this email directly, view it on GitHub https://github.com/COSIMA/mom6-panan/issues/47#issuecomment-1575923679, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA44U4J7SP3QNSWGZOHAZDXJU4F5ANCNFSM6AAAAAAY2JI3YY . You are receiving this because you commented.Message ID: @.***>

angus-g commented 1 year ago

~I'm not able to reproduce this, it might be a transient Gadi error, since~ Wilton is able to get past this stage (although with a different random crash).

Spoke too soon, I did reproduce the initial error, but with a traceback: https://gist.github.com/angus-g/bda8bd0f8e75dd642f233a7713da4e83

willaguiar commented 1 year ago

~I'm not able to reproduce this, it might be a transient Gadi error, since~ Wilton is able to get past this stage (although with a different random crash).

Spoke too soon, I did reproduce the initial error, but with a traceback: https://gist.github.com/angus-g/bda8bd0f8e75dd642f233a7713da4e83

My second crash (code 135) seemed to be a random problem in Gadi (a lot of Gadi nodes were down for a few minutes, and Gadi was offline. After going back online the run was killed without a mom6.out error, but with code 135 on pbs log).

I tried running again with the extra 5 GB memory, and the model now finished initialization and a first timestep.

willaguiar commented 1 year ago

Ok, simulation ran, but it was extremely slow and exceeded walltime before finishing it - ugh (panan_0025.o86502497). trying to re-run gives the same 137 error and memory values reaching the limit ( even with the extra 5GB now).

Should we look for a way to ask for extra memory limit? or should we try to use more cores for the run?

willaguiar commented 1 year ago

Update: I created a new simulation from the Github repo, and tried to run it for 1 day from IC (dir = /home/156/wf4500/v45_wf4500/panan0025/panan_fresh_Start_test/mom6-panan). The model crashed with the same error. The error again appears right after starting the Atmospheric model (mom6.err)

Beginning to initialize component models at 20230606 120108.960
 Starting to initialize atmospheric model at 20230606 120109.010
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)

So maybe there is something inconsistent/changed in the repo, or input files, or something different on sapphire nodes? @adele157 @angus-g

angus-g commented 1 year ago

something inconsistent/changed in the repo

nope, looks like everything is the same between the successful and failing runs

input files

nope, no changes in the manifest.yaml

something different on sapphire nodes

Seems the most likely: something system-related. Might be worth raising a ticket with NCI and saying that you're seeing jobs killed due to exceeding memory usage that were otherwise fine before.