coecms / access-esm

Main Repository for ACCESS-ESM configurations
0 stars 2 forks source link

pre-industrial configuration fails with segfault in 12 ranks #10

Open penguian opened 8 months ago

penguian commented 8 months ago

After cloning this repository to /g/data/tm70/pcl851/src/coecms/access-esm I ran the following commands, with the following output:

[pcl851@gadi-login-01 access-esm]$ git checkout -b pre-industrial-retest
Switched to a new branch 'pre-industrial-retest'
[snipped cleanup of directory contents]
[pcl851@gadi-login-01 access-esm]$ git status
On branch pre-industrial-retest
nothing to commit, working tree clean
[pcl851@gadi-login-01 access-esm]$ module use /g/data/hh5/public/modules/
[pcl851@gadi-login-01 access-esm]$ module use /g/data/access/modules
[pcl851@gadi-login-01 access-esm]$ module load um
[pcl851@gadi-login-01 access-esm]$ module load conda/analysis3-23.07
[pcl851@gadi-login-01 access-esm]$ payu --version
payu 1.0.19
[pcl851@gadi-login-01 access-esm]$ git remote -v
origin  https://github.com/coecms/access-esm (fetch)
origin  https://github.com/coecms/access-esm (push)
[pcl851@gadi-login-01 access-esm]$ payu init
laboratory path:  /scratch/tm70/pcl851/access-esm
binary path:  /scratch/tm70/pcl851/access-esm/bin
input path:  /scratch/tm70/pcl851/access-esm/input
work path:  /scratch/tm70/pcl851/access-esm/work
archive path:  /scratch/tm70/pcl851/access-esm/archive
[pcl851@gadi-login-01 access-esm]$ gvim config.yaml
[pcl851@gadi-login-01 access-esm]$ payu setup
laboratory path:  /scratch/tm70/pcl851/access-esm
binary path:  /scratch/tm70/pcl851/access-esm/bin
input path:  /scratch/tm70/pcl851/access-esm/input
work path:  /scratch/tm70/pcl851/access-esm/work
archive path:  /scratch/tm70/pcl851/access-esm/archive
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
Setting up atmosphere
Setting up ocean
Setting up ice
Setting up coupler
Checking exe and input manifests
Updating full hashes for 3 files in manifests/exe.yaml
Creating restart manifest
Updating full hashes for 30 files in manifests/restart.yaml
Writing manifests/restart.yaml
Writing manifests/exe.yaml
[pcl851@gadi-login-01 access-esm]$ payu run -f
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P tm70 -l walltime=11400 -l ncpus=384 -l mem=1536GB -N pre-industrial -l wd -j n -v PAYU_PATH=/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin,PAYU_FORCE=True,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/access/modules:/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/access+gdata/hh5+gdata/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/python3.10 /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu-run
108246795.gadi-pbs
[pcl851@gadi-login-01 access-esm]$ git log
commit 06d654c29219e079c7e073bddcb405799725c7ac (HEAD -> pre-industrial-retest)
Author: Paul Leopardi <paul.leopardi@anu.edu.au>
Date:   Thu Feb 15 12:11:38 2024 +1100

    2024-02-15 12:11:38: Run 0

commit 129f7542798bc7fd714872cf2e8212b4a708661c (origin/pre-industrial, pre-industrial)
Merge: 62aa1bf 75dce3f
Author: Holger Wolff <holger.wolff@monash.edu>
Date:   Tue Sep 12 15:55:05 2023 +1000

    Merge branch 'main' of github.com:coecms/access-esm into pre-industrial
[...]

The submitted job 108246795 fails.

[pcl851@gadi-login-01 access-esm]$ qstat -wax 108246795

gadi-pbs: 
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
108246795.gadi-pbs             pcl851          normal-exec     pre-industrial    189702    8   384  1536g 03:10 F 00:00:45
[pcl851@gadi-login-01 access-esm]$ qstat -wfx 108246795
Job Id: 108246795.gadi-pbs
    Job_Name = pre-industrial
    Job_Owner = pcl851@gadi-login-01.gadi.nci.org.au
    resources_used.cpupercent = 10508
    resources_used.cput = 01:08:01
    resources_used.mem = 167220884kb
    resources_used.ncpus = 384
    resources_used.vmem = 167220884kb
    resources_used.walltime = 00:00:45
    job_state = F
    queue = normal-exec
    server = gadi-pbs-01.gadi.nci.org.au
[...]
    comment = Job run at Thu Feb 15 at 12:11 on (gadi-cpu-clx-1997:ncpus=48:mem=201326592kb:jobfs=102400kb)+(gadi-cpu-clx-1998:ncpus=48:mem=201326592kb:jobfs=102400kb)+(gadi-cpu-clx-1999:ncpus=48:mem=201326592kb:jobfs=102400kb)+(gadi-cpu-clx-2000:ncpus=48:mem=201326... and failed
[...]
    Submit_arguments = -q normal -P tm70 -l walltime=11400 -l ncpus=384 -l mem=1536GB -N pre-industrial -l wd -j n -v PAYU_PATH=/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin,PAYU_FORCE=True,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/access/modules:/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/access+gdata/hh5+gdata/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/python3.10 /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu-run
[...]

Specifically, access.err indicates that a sefault occured on 12 of the MPI ranks:

[...]
[gadi-cpu-clx-1997:190436:0:190436] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1997:190452:0:190452] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1997:190418:0:190418] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-2000:181694:0:181694] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1999:1126161:0:1126161] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-2000:181711:0:181711] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1998:178142:0:178142] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1999:1126176:0:1126176] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-2000:181727:0:181727] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1998:178155:0:178155] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1999:1126190:0:1126190] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1998:178165:0:178165] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[...]
--------------------------------------------------------------------------
mpirun noticed that process rank 16 with PID 0 on node gadi-cpu-clx-1997 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
penguian commented 8 months ago

I then tried Martin Dix's fix as per https://github.com/coecms/access-esm/commit/0f769ae4338005f8ed3c6ce6478e004842d4a598

[pcl851@gadi-login-01 access-esm]$ cp -a ../../MartinDix/access-esm/config.yaml .
[pcl851@gadi-login-01 access-esm]$ payu sweep
laboratory path:  /scratch/tm70/pcl851/access-esm
binary path:  /scratch/tm70/pcl851/access-esm/bin
input path:  /scratch/tm70/pcl851/access-esm/input
work path:  /scratch/tm70/pcl851/access-esm/work
archive path:  /scratch/tm70/pcl851/access-esm/archive
Moving log pre-industrial.e108246795
Moving log pre-industrial.o108246795
Removing work path /scratch/tm70/pcl851/access-esm/work/access-esm
Removing symlink /g/data/tm70/pcl851/src/coecms/access-esm/work
[pcl851@gadi-login-01 access-esm]$ payu setup
laboratory path:  /scratch/tm70/pcl851/access-esm
binary path:  /scratch/tm70/pcl851/access-esm/bin
input path:  /scratch/tm70/pcl851/access-esm/input
work path:  /scratch/tm70/pcl851/access-esm/work
archive path:  /scratch/tm70/pcl851/access-esm/archive
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
Setting up atmosphere
Setting up ocean
Setting up ice
Setting up coupler
Checking exe and input manifests
Updating full hashes for 3 files in manifests/exe.yaml
File no longer in input directory: work/atmosphere/INPUT/pre-industrial.astart removing from manifest
Creating restart manifest
Updating full hashes for 51 files in manifests/restart.yaml
Writing manifests/input.yaml
Writing manifests/restart.yaml
Writing manifests/exe.yaml
[...]
[pcl851@gadi-login-01 access-esm]$ git diff
diff --git a/config.yaml b/config.yaml
index 85b1fb8..6d3935f 100644
--- a/config.yaml
+++ b/config.yaml
@@ -13,7 +13,6 @@ submodels:
       exe: /g/data/access/payu/access-esm/bin/coe/um7.3x
       input:
         - /g/data/access/payu/access-esm/input/pre-industrial/atmosphere
-        - /g/data/access/payu/access-esm/input/pre-industrial/start_dump

     - name: ocean
       model: mom
@@ -41,7 +40,7 @@ collate:
    restart: true
    mem: 4GB

-restart: /g/data/access/payu/access-esm/restart/pre-industrial
+restart: /g/data/vk83/experiments/inputs/access-esm1p5/pre-industrial/restart/

 calendar:
     start:
diff --git a/manifests/input.yaml b/manifests/input.yaml
index aeb3cc5..83e15b7 100644
--- a/manifests/input.yaml
+++ b/manifests/input.yaml
[...]
diff --git a/manifests/restart.yaml b/manifests/restart.yaml
index 01dc344..bd9bc55 100644
--- a/manifests/restart.yaml
+++ b/manifests/restart.yaml
[...]
[pcl851@gadi-login-01 access-esm]$ diff -rqb ../../MartinDix/access-esm .|grep -v ".git"|grep differ
Files ../../MartinDix/access-esm/atmosphere/__pycache__/um_env.cpython-310.pyc and ./atmosphere/__pycache__/um_env.cpython-310.pyc differ
[pcl851@gadi-login-01 access-esm]$ payu run -f
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P tm70 -l walltime=11400 -l ncpus=384 -l mem=1536GB -N pre-industrial -l wd -j n -v PAYU_PATH=/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin,PAYU_FORCE=True,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/access/modules:/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/access+gdata/hh5+gdata/tm70+gdata/vk83 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/python3.10 /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu-run
108250574.gadi-pbs

This job runs to completion.

MartinDix commented 8 months ago

See #11