Leeds-MONC / monc

MONC (Leeds fork)
BSD 3-Clause "New" or "Revised" License
5 stars 18 forks source link

MONC on ARCHER2 #32

Closed leifdenby closed 3 years ago

leifdenby commented 3 years ago

This is work-in-progress to get MONC compiling and running on ARCHER2

leifdenby commented 3 years ago

Debug commands I'm using on ARCHER2 (for my own reference):

Run MONC inside gdb4hpc:

$> gdb4hpc
gdb all> launch --args="--config=tests/straka_short.mcf --checkpoint_file=checkpoint_files/straka_dump.nc" --launcher-args="--partition=standard --qos=standard --tasks-per-node=2 --exclusive --export=all" $monc{2} ./build/bin/monc_driver.exe
leifdenby commented 3 years ago

Currently I'm stuck with an issue with a call to MPI_Alltoallv

earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc> fcm make -f fcm-make/monc-cray-cray.cfg
[init] make                # 2020-12-15T15:15:52Z
[info] FCM 2019.05.0 (/home2/home/ta009/ta009/earlcd/fcm-2019.09.0)
[init] make config-parse   # 2020-12-15T15:15:52Z
[info] config-file=/lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/monc-cray-cray.cfg
[info] config-file= - /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/comp-cray-2107.cfg
[info] config-file= - /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/env-cray.cfg
[info] config-file= - /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/monc-build.cfg
[done] make config-parse   # 0.0s
[init] make dest-init      # 2020-12-15T15:15:52Z
[info] dest=earlcd@uan01:/lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc
[info] mode=incremental
[done] make dest-init      # 0.0s
[init] make extract        # 2020-12-15T15:15:52Z
[info] location  monc: 0: /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc
[info]   dest:  381 [U unchanged]
[info] source:  381 [U from base]
[done] make extract        # 0.4s
[init] make preprocess     # 2020-12-15T15:15:53Z
[info] sources: total=381, analysed=0, elapsed-time=0.2s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.0s
[info] install   targets: modified=0, unchanged=8, failed=0, total-time=0.0s
[info] process   targets: modified=0, unchanged=172, failed=0, total-time=0.0s
[info] TOTAL     targets: modified=0, unchanged=180, failed=0, elapsed-time=0.2s
[done] make preprocess     # 0.8s
[init] make build          # 2020-12-15T15:15:54Z
[info] sources: total=381, analysed=0, elapsed-time=0.1s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.1s
[info] compile   targets: modified=120, unchanged=3, failed=0, total-time=176.7s
[info] compile+  targets: modified=112, unchanged=7, failed=0, total-time=0.5s
[info] link      targets: modified=1, unchanged=0, failed=0, total-time=0.5s
[info] TOTAL     targets: modified=233, unchanged=10, failed=0, elapsed-time=178.1s
[done] make build          # 178.3s
[done] make                # 179.6s
earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc>
(reverse-i-search)`': ^C
earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc> sbatch utils/archer2/submonc.slurm
Submitted batch job 59769
earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc> cat slurm-59769.out
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Currently Loaded Modulefiles:
 1) cpe-cray
 2) cce/10.0.4(default)
 3) craype/2.7.2(default)
 4) craype-x86-rome
 5) libfabric/1.11.0.0.233(default)
 6) craype-network-ofi
 7) cray-dsmml/0.1.2(default)
 8) perftools-base/20.10.0(default)
 9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default)
10) cray-mpich/8.0.16(default)
11) cray-libsci/20.10.1.2(default)
12) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
13) epcc-job-env
14) cray-netcdf/4.7.4.2(default)
15) cray-fftw/3.3.8.8(default)
16) cray-hdf5/1.12.0.2(default)
MPICH ERROR [Rank 1] [job id 59769.0] [Tue Dec 15 15:22:06 2020] [unknown] [nid001139] - Abort(403275522) (rank 1 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x9183600, scnts=0x45becc0, sdispls=0x47f7a00, MPI_DOUBLE_PRECISION, rbuf=0x92888c0, rcnts=0x47f6540, rdispls=0x47f4040, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1207723264

aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x9183600, scnts=0x45becc0, sdispls=0x47f7a00, MPI_DOUBLE_PRECISION, rbuf=0x92888c0, rcnts=0x47f6540, rdispls=0x47f4040, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1207723264
[INFO] MONC running with 1 processes, 1 IO server(s)
[WARN] No enabled configuration for component ideal_squall therefore disabling this
[WARN] No enabled configuration for component kid_testcase therefore disabling this
[WARN] Run order callback for component tank_experiments at stage initialisation not specified
[WARN] Run order callback for component tank_experiments at stage finalisation not specified
[WARN] Defaulting to one dimension decomposition due to solution size too small
[INFO] Decomposed 1 processes via 'OneDim' into z=1 y=1 x=1
[INFO] 3D system; z=65, y=512, x=2
srun: error: nid001139: task 1: Exited with exit code 255
srun: Terminating job step 59769.0
slurmstepd: error: *** STEP 59769.0 ON nid001139 CANCELLED AT 2020-12-15T15:22:06 ***
srun: error: nid001139: task 0: Terminated
srun: Force Terminated job step 59769.0
leifdenby commented 3 years ago

I've tried compiling with fcm-make/comp-cray-2107-debug.cfg and using gdb4hpc to identify the issue. Within gdb4hpc I'm stuck since I don't get any output when trying to print local variables:

dbg all> launch --args="--config=tests/straka_short.mcf --checkpoint_file=checkpoint_files/straka_dump.nc" --launcher-args="--partition=standard --qos=standard --tasks-per-node=2 --exclusive --export=all" $monc{2} ./build/bin/monc_driver.exe
Starting application, please wait...
Creating MRNet communication network...
Waiting for debug servers to attach to MRNet communications network...
Timeout in 400 seconds. Please wait for the attach to complete.
Number of dbgsrvs connected: [1];  Timeout Counter: [0]
Number of dbgsrvs connected: [1];  Timeout Counter: [1]
Number of dbgsrvs connected: [2];  Timeout Counter: [0]
Finalizing setup...
Launch complete.
monc{0..1}: Initial breakpoint, monc_driver at /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/preprocess/src/monc/monc_driver.F90:16
dbg all> break pencilfft.F90:360
...
bg all> print source_data
monc{0}: *** The application is running
dbg all> print size(source_data)
sjboeing commented 3 years ago

Hi @leifdenby: since these are MPI issues, I thought the changes that Chris applied for ARC4 may be worth exploring, in case you have not done so yet.

leifdenby commented 3 years ago

Hi @leifdenby: since these are MPI issues, I thought the changes that Chris applied for ARC4 may be worth exploring, in case you have not done so yet.

Great idea @sjboeing! I'll give this a try

leifdenby commented 3 years ago

Unfortunately the fixes introduced for ARC4 don't appear to have fixed the issue @sjboeing. But I have an idea what the issue might be. I'll put my testing in separate comments below

leifdenby commented 3 years ago

(optimised) compiling with cray fortran compiler and running

compiling ```bash earlcd@uan01:~/work/monc> module restore PrgEnv-cray Unloading cray-hdf5/1.12.0.2 Unloading cray-fftw/3.3.8.8 Unloading cray-netcdf/4.7.4.2 Unloading /usr/local/share/epcc-module/epcc-module-loader Warning: Unloading the epcc-setup-env module will stop many modules being available on the system. If you do this by accident, you can recover the situation with the command: module load /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env Unloading bolt/0.7 Unloading cray-libsci/20.10.1.2 Unloading cray-mpich/8.0.16 Unloading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta Unloading perftools-base/20.10.0 WARNING: Did not unuse /opt/cray/pe/perftools/20.10.0/modulefiles Unloading cray-dsmml/0.1.2 Unloading craype-network-ofi Unloading libfabric/1.11.0.0.233 Unloading craype-x86-rome Unloading craype/2.7.2 Unloading gcc/10.1.0 Unloading cpe-gnu Loading cpe-cray Loading cce/10.0.4 Loading craype/2.7.2 Loading craype-x86-rome Loading libfabric/1.11.0.0.233 Loading craype-network-ofi Loading cray-dsmml/0.1.2 Loading perftools-base/20.10.0 Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta Loading cray-mpich/8.0.16 Loading cray-libsci/20.10.1.2 Loading bolt/0.7 Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env Loading /usr/local/share/epcc-module/epcc-module-loader earlcd@uan01:~/work/monc> module load cray-hdf5 cray-netcdf cray-fftw earlcd@uan01:~/work/monc> fcm make -f fcm-make/monc-cray-cray.cfg [init] make # 2021-01-26T12:16:55Z [info] FCM 2019.05.0 (/home1/home/n02/n02/earlcd/fcm-2019.09.0) [init] make config-parse # 2021-01-26T12:16:55Z [info] config-file=/lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-cray-cray.cfg [info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/comp-cray-2107.cfg [info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/env-cray.cfg [info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-build.cfg [done] make config-parse # 0.1s [init] make dest-init # 2021-01-26T12:16:55Z [info] dest=earlcd@uan01:/lus/cls01095/work/n02/n02/earlcd/monc [info] mode=incremental [done] make dest-init # 0.1s [init] make extract # 2021-01-26T12:16:55Z [info] location monc: 0: /lus/cls01095/work/n02/n02/earlcd/monc [info] dest: 381 [U unchanged] [info] source: 381 [U from base] [done] make extract # 7.7s [init] make preprocess # 2021-01-26T12:17:03Z [info] sources: total=381, analysed=180, elapsed-time=0.2s, total-time=0.1s [info] target-tree-analysis: elapsed-time=0.0s [info] install targets: modified=8, unchanged=0, failed=0, total-time=0.1s [info] process targets: modified=172, unchanged=0, failed=0, total-time=14.1s [info] TOTAL targets: modified=180, unchanged=0, failed=0, elapsed-time=14.3s [done] make preprocess # 14.7s [init] make build # 2021-01-26T12:17:17Z [info] sources: total=381, analysed=381, elapsed-time=1.5s, total-time=1.4s [info] target-tree-analysis: elapsed-time=0.3s [info] compile targets: modified=123, unchanged=0, failed=0, total-time=209.8s [info] compile+ targets: modified=119, unchanged=0, failed=0, total-time=1.3s [info] link targets: modified=1, unchanged=0, failed=0, total-time=2.0s [info] TOTAL targets: modified=243, unchanged=0, failed=0, elapsed-time=213.7s [done] make build # 215.4s [done] make # 238.0s ```
running MONC ```bash earlcd@uan01:~/work/monc> sbatch utils/archer2/submonc.slurm Submitted batch job 77481 ```
output of SLURM log ```bash earlcd@uan01:~/work/monc> cat slurm-77481.out Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile Loading cpe-cray Loading cce/10.0.4 Loading craype/2.7.2 Loading craype-x86-rome Loading libfabric/1.11.0.0.233 Loading craype-network-ofi Loading cray-dsmml/0.1.2 Loading perftools-base/20.10.0 Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta Loading cray-mpich/8.0.16 Loading cray-libsci/20.10.1.2 Loading bolt/0.7 Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env Loading epcc-job-env Loading requirement: bolt/0.7 Currently Loaded Modulefiles: 1) cpe-cray 2) cce/10.0.4(default) 3) craype/2.7.2(default) 4) craype-x86-rome 5) libfabric/1.11.0.0.233(default) 6) craype-network-ofi 7) cray-dsmml/0.1.2(default) 8) perftools-base/20.10.0(default) 9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default) 10) cray-mpich/8.0.16(default) 11) cray-libsci/20.10.1.2(default) 12) bolt/0.7 13) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env 14) epcc-job-env [INFO] MONC running with 4 processes, 1 IO server(s) [WARN] No enabled configuration for component ideal_squall therefore disabling this [WARN] No enabled configuration for component kid_testcase therefore disabling this [WARN] Run order callback for component tank_experiments at stage initialisation not specified [WARN] Run order callback for component tank_experiments at stage finalisation not specified [WARN] Defaulting to one dimension decomposition due to solution size too small [INFO] Decomposed 4 processes via 'OneDim' into z=1 y=4 x=1 [INFO] 3D system; z=65, y=512, x=2 MPICH ERROR [Rank 1] [job id 77481.0] [Tue Jan 26 12:22:49 2021] [unknown] [nid001037] - Abort(403275522) (rank 1 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5cd2600, scnts=0x47fa200, sdispls=0x47f8d40, MPI_DOUBLE_PRECISION, rbuf=0x5d16bc0, rcnts=0x47f3400, rdispls=0x47f0200, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed PMPI_Alltoallv(351): Negative count, value is -404335872 aborting job: Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5cd2600, scnts=0x47fa200, sdispls=0x47f8d40, MPI_DOUBLE_PRECISION, rbuf=0x5d16bc0, rcnts=0x47f3400, rdispls=0x47f0200, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed PMPI_Alltoallv(351): Negative count, value is -404335872 MPICH ERROR [Rank 2] [job id 77481.0] [Tue Jan 26 12:22:49 2021] [unknown] [nid001037] - Abort(622338) (rank 2 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c6c940, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5cacf40, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed PMPI_Alltoallv(351): Negative count, value is -1539026176 aborting job: Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c6c940, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5cacf40, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed PMPI_Alltoallv(351): Negative count, value is -1539026176 MPICH ERROR [Rank 4] [job id 77481.0] [Tue Jan 26 12:22:49 2021] [unknown] [nid001037] - Abort(336166658) (rank 4 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c3f440, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5c7fc80, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed PMPI_Alltoallv(351): Negative count, value is -1178053888 aborting job: Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c3f440, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5c7fc80, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed PMPI_Alltoallv(351): Negative count, value is -1178053888 srun: error: nid001037: tasks 1-2,4: Exited with exit code 255 srun: Terminating job step 77481.0 slurmstepd: error: *** STEP 77481.0 ON nid001037 CANCELLED AT 2021-01-26T12:22:49 *** srun: error: nid001037: tasks 0,3: Terminated srun: Force Terminated job step 77481.0 ```

This run-time error suggests to me that the routine calculating the size of the buffer used to make the MPI communication is doing the calculation incorrectly.

I should also note that compiling with debug flags (using fcm-make/monc-cray-cray-debug.cfg, this compiles) the SLURM job simply aborts (not clear from the log why).

leifdenby commented 3 years ago

compiling with GNU fortran compiler and running

compiling ```bash earlcd@uan01:~/work/monc> module restore PrgEnv-gnu Unloading /usr/local/share/epcc-module/epcc-module-loader Warning: Unloading the epcc-setup-env module will stop many modules being available on the system. If you do this by accident, you can recover the situation with the command: module load /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env Unloading bolt/0.7 Unloading cray-libsci/20.10.1.2 Unloading cray-mpich/8.0.16 Unloading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta Unloading perftools-base/20.10.0 WARNING: Did not unuse /opt/cray/pe/perftools/20.10.0/modulefiles Unloading cray-dsmml/0.1.2 Unloading craype-network-ofi Unloading libfabric/1.11.0.0.233 Unloading craype-x86-rome Unloading craype/2.7.2 Unloading cce/10.0.4 Unloading cpe-cray Loading cpe-gnu Loading gcc/10.1.0 Loading craype/2.7.2 Loading craype-x86-rome Loading libfabric/1.11.0.0.233 Loading craype-network-ofi Loading cray-dsmml/0.1.2 Loading perftools-base/20.10.0 Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta Loading cray-mpich/8.0.16 Loading cray-libsci/20.10.1.2 Loading bolt/0.7 Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env Loading /usr/local/share/epcc-module/epcc-module-loader earlcd@uan01:~/work/monc> module load cray-hdf5 cray-netcdf cray-fftw (reverse-i-search)`': ^C earlcd@uan01:~/work/monc> ftn --version GNU Fortran (GCC) 10.1.0 20200507 (Cray Inc.) Copyright (C) 2020 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. earlcd@uan01:~/work/monc> fcm make -f fcm-make/monc-cray-gnu.cfg [init] make # 2021-01-26T12:30:49Z [info] FCM 2019.05.0 (/home1/home/n02/n02/earlcd/fcm-2019.09.0) [init] make config-parse # 2021-01-26T12:30:49Z [info] config-file=/lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-cray-gnu.cfg [info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/comp-gnu-4.4.7.cfg [info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/env-cray.cfg [info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-build.cfg [done] make config-parse # 0.1s [init] make dest-init # 2021-01-26T12:30:49Z [info] dest=earlcd@uan01:/lus/cls01095/work/n02/n02/earlcd/monc [info] mode=incremental [done] make dest-init # 0.1s [init] make extract # 2021-01-26T12:30:49Z [info] location monc: 0: /lus/cls01095/work/n02/n02/earlcd/monc [info] dest: 381 [U unchanged] [info] source: 381 [U from base] [done] make extract # 0.8s [init] make preprocess # 2021-01-26T12:30:50Z [info] sources: total=381, analysed=0, elapsed-time=1.2s, total-time=0.0s [info] target-tree-analysis: elapsed-time=0.0s [info] install targets: modified=0, unchanged=8, failed=0, total-time=0.0s [info] process targets: modified=0, unchanged=172, failed=0, total-time=0.0s [info] TOTAL targets: modified=0, unchanged=180, failed=0, elapsed-time=2.2s [done] make preprocess # 5.6s [init] make build # 2021-01-26T12:30:56Z [info] sources: total=381, analysed=0, elapsed-time=0.1s, total-time=0.0s [info] target-tree-analysis: elapsed-time=0.2s [FAIL] ftn -oo/conditional_diagnostics_whole_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/conditional_diagnostics_whole/src/conditional_diagnostics_whole.F90 # rc=1 [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/conditional_diagnostics_whole/src/conditional_diagnostics_whole.F90:80:22: [FAIL] [FAIL] 77 | call mpi_reduce(MPI_IN_PLACE , CondDiags_tot, ncond*2*ndiag*current_state%local_grid%size(Z_INDEX), & [FAIL] | 2 [FAIL] ...... [FAIL] 80 | call mpi_reduce(CondDiags_tot, CondDiags_tot, ncond*2*ndiag*current_state%local_grid%size(Z_INDEX), & [FAIL] | 1 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(8)/INTEGER(4)). [FAIL] compile 0.1 ! conditional_diagnostics_whole_mod.o <- monc/components/conditional_diagnostics_whole/src/conditional_diagnostics_whole.F90 [FAIL] ftn -oo/iterativesolver_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/iterativesolver/src/iterativesolver.F90 # rc=1 [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/iterativesolver/src/iterativesolver.F90:540:23: [FAIL] [FAIL] 540 | call mpi_allreduce(local_sum, global_sum, 3, PRECISION_TYPE, MPI_SUM, current_state%parallel%monc_communicator, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 600 | call mpi_allreduce(current_state%local_divmax, current_state%global_divmax, 1, PRECISION_TYPE, MPI_MAX, & [FAIL] | 2 [FAIL] Error: Rank mismatch between actual argument at (1) and actual argument at (2) (scalar and rank-1) [FAIL] compile 0.1 ! iterativesolver_mod.o <- monc/components/iterativesolver/src/iterativesolver.F90 [FAIL] ftn -oo/monc_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . -frecursive /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/model_core/src/monc.F90 # rc=1 [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/model_core/src/monc.F90:207:60: [FAIL] [FAIL] 207 | call mpi_barrier(state%parallel%monc_communicator, ierr) [FAIL] | 1 [FAIL] Error: More actual than formal arguments in procedure call at (1) [FAIL] compile 0.1 ! monc_mod.o <- monc/model_core/src/monc.F90 [FAIL] ftn -oo/io_server_client_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . -frecursive /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90 # rc=1 [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:168:25: [FAIL] [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] ...... [FAIL] 168 | call mpi_get_address(basic_type%dimensions, num_addr, ierr) [FAIL] | 1 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:173:25: [FAIL] [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] ...... [FAIL] 173 | call mpi_get_address(basic_type%dim_sizes, num_addr, ierr) [FAIL] | 1 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:124:25: [FAIL] [FAIL] 124 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (TYPE(field_description_type)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:129:25: [FAIL] [FAIL] 129 | call mpi_get_address(basic_type%field_name, num_addr, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (CHARACTER(150)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:134:25: [FAIL] [FAIL] 134 | call mpi_get_address(basic_type%field_type, num_addr, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:139:25: [FAIL] [FAIL] 139 | call mpi_get_address(basic_type%data_type, num_addr, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:144:25: [FAIL] [FAIL] 144 | call mpi_get_address(basic_type%optional, num_addr, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (LOGICAL(4)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:92:25: [FAIL] [FAIL] 92 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (TYPE(definition_description_type)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:97:25: [FAIL] [FAIL] 97 | call mpi_get_address(basic_type%send_on_terminate, num_addr, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (LOGICAL(4)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:102:25: [FAIL] [FAIL] 102 | call mpi_get_address(basic_type%number_fields, num_addr, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)). [FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:107:25: [FAIL] [FAIL] 107 | call mpi_get_address(basic_type%frequency, num_addr, ierr) [FAIL] | 1 [FAIL] ...... [FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr) [FAIL] | 2 [FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)). [FAIL] compile 0.1 ! io_server_client_mod.o <- monc/io/src/ioclient.F90 [info] compile targets: modified=0, unchanged=91, failed=4, total-time=0.3s [info] compile+ targets: modified=0, unchanged=88, failed=0, total-time=0.0s [info] TOTAL targets: modified=0, unchanged=179, failed=8, elapsed-time=2.0s [FAIL] ! conditional_diagnostics_whole_mod.mod: depends on failed target: conditional_diagnostics_whole_mod.o [FAIL] ! conditional_diagnostics_whole_mod.o: update task failed [FAIL] ! io_server_client_mod.mod: depends on failed target: io_server_client_mod.o [FAIL] ! io_server_client_mod.o: update task failed [FAIL] ! iterativesolver_mod.mod: depends on failed target: iterativesolver_mod.o [FAIL] ! iterativesolver_mod.o: update task failed [FAIL] ! monc_mod.mod : depends on failed target: monc_mod.o [FAIL] ! monc_mod.o : update task failed [FAIL] make build # 2.3s [FAIL] make # 8.9s ```

does not compile

With the GNU compiler MONC fails to compile. All the errors are related to incorrect datatypes being passed to MPI-related subroutines (as far as I can see). I think these are bugs and fixing these may resolve the issue we are having at runtime. It is possible that the GNU compiler is being more strict here and catching these bugs at compile-time.

Thoughts @sjboeing ?

leifdenby commented 3 years ago

Digging a little further it I've added some print statements. It appears that the counts are calculated incorrectly (I'm compiling with Cray Fortran again here)

earlcd@uan01:~/work/monc> cat slurm-77539.out 
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env

Loading epcc-job-env
  Loading requirement: bolt/0.7
Currently Loaded Modulefiles:
 1) cpe-cray                                                         
 2) cce/10.0.4(default)                                              
 3) craype/2.7.2(default)                                            
 4) craype-x86-rome                                                  
 5) libfabric/1.11.0.0.233(default)                                  
 6) craype-network-ofi                                               
 7) cray-dsmml/0.1.2(default)                                        
 8) perftools-base/20.10.0(default)                                  
 9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default)               
10) cray-mpich/8.0.16(default)                                       
11) cray-libsci/20.10.1.2(default)                                   
12) bolt/0.7                                                         
13) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env  
14) epcc-job-env                                                     
MPICH ERROR [Rank 2] [job id 77539.0] [Tue Jan 26 13:07:48 2021] [unknown] [nid001199] - Abort(1007255298) (rank 2 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5ce63c0, scnts=0x45fba80, sdispls=0x45fb180, MPI_DOUBLE_PRECISION, rbuf=0x5d26a00, rcnts=0x45fa400, rdispls=0x46279c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -919842048

aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5ce63c0, scnts=0x45fba80, sdispls=0x45fb180, MPI_DOUBLE_PRECISION, rbuf=0x5d26a00, rcnts=0x45fa400, rdispls=0x46279c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -919842048
 debug send_sizes 4352,  3*4096
 debug recv_sizes 4*4096
 debug send_sizes 16448
 debug recv_sizes 16448
 debug send_sizes 32896
 debug recv_sizes -919842048
MPICH ERROR [Rank 1] [job id 77539.0] [Tue Jan 26 13:07:48 2021] [unknown] [nid001199] - Abort(1007255298) (rank 1 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5ca16c0, scnts=0x47f8200, sdispls=0x47f6d40, MPI_DOUBLE_PRECISION, rbuf=0x5ce5cc0, rcnts=0x47f1400, rdispls=0x47ee200, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1300475136

I've gotten as far as identifying that determine_offsets_from_size (also in compiled MONC docs) is in charge of computing these offsets. I think the next step will be to work out why this subroutine isn't calculating positive values (as it should).

sjboeing commented 3 years ago

Hi @Leif, this looks like it is not really trivial. two things you may try are: 1) disable the ioserver component, just to see if the issue has to do with it. (just use enable_io_server=.false.?) 2) Run a BOMEX case. I think the straka case is effectively 2D, and may not be as well accommodated as the runs on a 3D domain.

leifdenby commented 3 years ago

Thanks for the suggestions @sjboeing, I've tried both and the issue persists.

with `enable_io_server=.false.` ```bash aborting job: Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x570ab40, scnts=0x4611440, sdispls=0x45fa740, MPI_DOUBLE_PRECISION, rbuf=0x573f040, rcnts=0x45f9e40, rdispls=0x45f90c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed PMPI_Alltoallv(351): Negative count, value is -1112427776 debug send_sizes 5*2652 debug recv_sizes 2*2678, 3*2652 debug send_sizes 13364 debug recv_sizes 13364 debug send_sizes 26728 debug recv_sizes -1090014464 MPICH ERROR [Rank 4] [job id 78435.0] [Tue Jan 26 21:33:32 2021] [unknown] [nid001010] - Abort(940146434) (rank 4 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x570a080, scnts=0x4611440, sdispls=0x45fa740, MPI_DOUBLE_PRECISION, rbuf=0x573e580, rcnts=0x45f9e40, rdispls=0x45f90c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed PMPI_Alltoallv(351): Negative count, value is -1090014464 ```

and

using the `testcases/shallow_convection/bomex.mcf` configuration ```bash aborting job: Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x9961800, scnts=0x4606440, sdispls=0x4605b40, MPI_DOUBLE_PRECISION, rbuf=0x9a03480, rcnts=0x47c1c80, rdispls=0x47bf780, datatype=MPI_DOUBLE_PRECISION, comm=comm=0xc4000000) failed PMPI_Alltoallv(351): Negative count, value is -746590336 debug send_sizes 2*38912 debug recv_sizes 2*38912 debug send_sizes 2*40128 debug recv_sizes 2*40128 debug send_sizes 2*41382 debug recv_sizes -2011828352, 0 MPICH ERROR [Rank 4] [job id 78459.0] [Tue Jan 26 22:11:40 2021] [unknown] [nid001010] - Abort(134840066) (rank 4 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack: PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x99602c0, scnts=0x4606440, sdispls=0x4605b40, MPI_DOUBLE_PRECISION, rbuf=0x9a01f40, rcnts=0x47c1c80, rdispls=0x47bf780, datatype=MPI_DOUBLE_PRECISION, comm=comm=0x84000007) failed PMPI_Alltoallv(351): Negative count, value is -2011828352 ```
MarkUoLeeds commented 3 years ago

Hi @leifdenby Leif, just wondering what choice of moncs _per_io is set. Is this a version that still uses FFTW rather than the MOSRS new HEAD that uses FFTE? The message seemed to indicate fft_pencil in earlier output. Note Archer2 has 128 cores per node and 8 NUMA regions per node so it is best to have 1 io per numa region: that is 15 moncs per io server. [sorry if you already knew that I did not look at your case]. alternatively do 63 monc per io so that one IO server per socket.

MarkUoLeeds commented 3 years ago

I just re-read some of your compile woes. NOTE GCC 10. does not like the fact that MPI calls can use any data type so we have to apply " -fallow-argument-mismatch "; I found that back in November and archer team added it to the documentation on building: https://docs.archer2.ac.uk/user-guide/dev-environment/

MarkUoLeeds commented 3 years ago

Did you solve this yet? I see the only ref to mpi_alltoallv appears to be in components/fftsolver/src/pencilfft.F90 I was (am) working with the MOSRS code at r8166 where Adrian seems to make most of his branches start. Perhaps I should turn my attention to this repo.

leifdenby commented 3 years ago

Did you solve this yet? I see the only ref to mpi_alltoallv appears to be in components/fftsolver/src/pencilfft.F90 I was (am) working with the MOSRS code at r8166 where Adrian seems to make most of his branches start. Perhaps I should turn my attention to this repo.

I haven't no :cry: The farthest I've gotten is producing a branch (see https://github.com/Leeds-MONC/monc/pull/38) which contains all the commits that Adrian has made on MOSRS where he is working on ARCHER2 fixes. As you know this branch includes a lot of changes and as it stands also reverses changes Chris recently did for ARC4.

I am going to try and cherry-pick just the first four commits and see if that helps with running on ARCHER2.

I just re-read some of your compile woes. NOTE GCC 10. does not like the fact that MPI calls can use any data type so we have to apply " -fallow-argument-mismatch "; I found that back in November and archer team added it to the documentation on building: https://docs.archer2.ac.uk/user-guide/dev-environment/

Thank you for suggesting this. I'll give it a try with adding that compilation flag.

just wondering what choice of moncs _per_io is set. Is this a version that still uses FFTW rather than the MOSRS new HEAD that uses FFTE? The message seemed to indicate fft_pencil in earlier output. Note Archer2 has 128 cores per node and 8 NUMA regions per node so it is best to have 1 io per numa region: that is 15 moncs per io server. [sorry if you already knew that I did not look at your case]. alternatively do 63 monc per io so that one IO server per socket.

Thank you for suggesting this. I'm not quiet sure how to work this out. If you check my run log above you'll see:

[INFO] MONC running with 4 processes, 1 IO server(s)

My moncs_per_io=3 (https://github.com/leifdenby/monc/blob/archer2-compilation/tests/straka_short.mcf#L38) and I think I'm requesting 5 cores in my job (https://github.com/leifdenby/monc/blob/archer2-compilation/utils/archer2/submonc.slurm#L7)

Does that sound reasonable or am I doing something obviously stupid?

MarkUoLeeds commented 3 years ago

The job looks poorly specified. If you want to ru a total of 4 MPI tasks (i.e. 1 io and 3 monc) then tasks-per-node should be also be 4. but then yo might choose to spread them out unless this is just a really basic job and you are happy for all tasks to sit in same numa region. When doing a proper job consider the cpus-per-task if you have fewer than 128 tasks on one node.

cemac-ccs commented 3 years ago

Per Ralph's notes, and in agreement with @MarkUoLeeds , it seems that the Leeds branch works with gnu 9.3.0 but not the default gnu 10. Having my module load order as

export PATH=$PATH:/work/y07/shared/umshared/bin
export PATH=$PATH:/work/y07/shared/umshared/software/bin
. mosrs-setup-gpg-agent

module restore PrgEnv-cray
module load cpe-gnu
module load gcc/9.3.0
module load cray-netcdf-hdf5parallel
module load cray-hdf5-parallel
module load cray-fftw/3.3.8.7
module load petsc/3.13.3

seems the best option for compiling with gnu using fcm make -j4 -f fcm-make/monc-cray-gnu.cfg. The PATH inclusions add in the installed version of fcm and allow you to cache your MOSRS password with the . mosrs-setup-gpg-agent command (as needed when getting casim and socrates from MOSRS)

Still trying out the cray compiler so can't comment on that, and waiting on the test job to run, but thought I'd mention this

leifdenby commented 3 years ago

thanks for the help here. I think we can close this now that @cemac-ccs is working on a pull-request for ARCHER2: https://github.com/Leeds-MONC/monc/pull/45