destination-earth-digital-twins / DE330_Training_2024

1 stars 2 forks source link

doede forecast runtime error, #12

Open iljamal opened 5 months ago

iljamal commented 5 months ago

Starting the demo forecast:

deode case ?deode/data/config_files/configurations/cy48t3_arome -o cy48t3_arome.toml --start-suite The Workflow reached to the forecast step image

Which end with some netcdf symbol error :

snipet from ecflow output log:

[ECMWF-INFO -ecsbatch] - -------------------------------------------------------------------------------------
[ECMWF-INFO -ecsbatch] -  This is the ECMWF jobfilter
[ECMWF-INFO -ecsbatch] -  +++ Please report issues using the Support portal +++
[ECMWF-INFO -ecsbatch] -  +++ https://support.ecmwf.int                     +++
[ECMWF-INFO -ecsbatch] -  /usr/local/bin/ecsbatch: size: 49350, mtime: Thu Mar 14 09:29:45 2024
[ECMWF-INFO -ecsbatch] - -------------------------------------------------------------------------------------
[ECMWF-INFO -ecsbatch] - Time at submit: Wed Jun 19 07:29:51 2024 (1718782191.4708633) on ac6-209.bullx:/etc/ecmwf/nfs/dh1_home_b/eeim
[ECMWF-INFO -ecsbatch] - --- SLURM VARIABLES ---
[ECMWF-INFO -ecsbatch] - EC_CLUSTER=ac
[ECMWF-INFO -ecsbatch] - SLURM_EXPORT_ENV=ALL
[ECMWF-INFO -ecsbatch] - SBATCH_EXPORT=NONE
[ECMWF-INFO -ecsbatch] - -----------------------
[ECMWF-INFO -ecsbatch] - jobscript received on STDIN
[ECMWF-INFO -ecsbatch] - --- SCRIPT OPTIONS ---
[ECMWF-INFO -ecsbatch] - #SBATCH --output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1
[ECMWF-INFO -ecsbatch] - #SBATCH --job-name=Forecast
[ECMWF-INFO -ecsbatch] - #SBATCH --qos=np
[ECMWF-INFO -ecsbatch] - #SBATCH --signal=USR1@30
[ECMWF-INFO -ecsbatch] - #SBATCH --time=01:00:00
[ECMWF-INFO -ecsbatch] - #SBATCH --nodes=2
[ECMWF-INFO -ecsbatch] - #SBATCH --ntasks=32
[ECMWF-INFO -ecsbatch] - -----------------------
[ECMWF-INFO -ecsbatch] - --- POST-PROCESSED OPTIONS ---
[ECMWF-INFO -ecsbatch] - ARG --job_name=Forecast
[ECMWF-INFO -ecsbatch] - ARG --ntasks=32
[ECMWF-INFO -ecsbatch] - ARG --nodes=2
[ECMWF-INFO -ecsbatch] - ARG --output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1
[ECMWF-INFO -ecsbatch] - ARG --qos=np
[ECMWF-INFO -ecsbatch] - ARG --signal=USR1@30
[ECMWF-INFO -ecsbatch] - ARG --time=01:00:00
[ECMWF-INFO -ecsbatch] - ------------------------------
[ECMWF-INFO -ecsbatch] - jobtag: eeim-Forecast-2x512-/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast
[ECMWF-INFO -ecsbatch] - ['/usr/bin/sbatch', '--job-name=Forecast', '--ntasks=32', '--nodes=2', '--output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1', '--qos=np', '--signal=USR1@30', '--time=01:00:00', '--licenses=h2resw01', '--export=EC_user_time_limit=01:00:00']
[ECMWF-INFO -ecsbatch] - ecsbatch executed on ac
[ECMWF-INFO -ecsbatch] - Job queued on ac using method local
[ECMWF-INFO -ecsbatch] - Submitted batch job 38281261
[ECMWF-INFO -ecprofile] /usr/bin/bash NON_INTERACTIVE on ac1-2015 at 20240619_073000.871, PID: 2377574, JOBID: 38281261
[ECMWF-INFO -ecprofile] $SCRATCH=/ec/res4/scratch/eeim
[ECMWF-INFO -ecprofile] $PERM=/perm/eeim
[ECMWF-INFO -ecprofile] $HPCPERM=/ec/res4/hpcperm/eeim
[ECMWF-INFO -ecprofile] $TMPDIR=/dev/shm/_tmpdir_.eeim.38281261
[ECMWF-INFO -ecprofile] $SCRATCHDIR=/ec/res4/scratchdir/eeim/5/38281261

The following have been reloaded with a version change:
  1) ecmwf-toolbox/2024.04.0.0 => ecmwf-toolbox/2024.02.1.0

The following have been reloaded with a version change:
  1) hdf5/1.14.3 => hdf5/1.10.6

The following have been reloaded with a version change:
  1) netcdf4/4.9.2 => netcdf4/4.7.4

Lmod is automatically replacing "openmpi/4.1.5.4" with "hpcx-openmpi/2.9.0".

Due to MODULEPATH changes, the following have been reloaded:
  1) ecmwf-toolbox/2024.02.1.0     3) hdf5/1.10.6            5) netcdf4/4.7.4
  2) fftw/3.3.9                    4) hpcx-openmpi/2.9.0

The following have been reloaded with a version change:
  1) prgenv/gnu => prgenv/intel

2024-06-19 07:30:11 | INFO     |    Only wait 20 seconds, if the server cannot be contacted (note default is 24 hours) before failing
2024-06-19 07:30:11 | INFO     | Calling init at: 07:30:11
2024-06-19 07:30:12 | INFO     | Running task /CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast
2024-06-19 07:30:12 | INFO     | Task search path: ['/etc/ecmwf/nfs/dh1_home_b/eeim/Deode-Workflow/deode/tasks']
2024-06-19 07:30:12 | INFO     | Loading module deode.tasks.archive[ECMWF-INFO -ecsbatch] - -------------------------------------------------------------------------------------
[ECMWF-INFO -ecsbatch] -  This is the ECMWF jobfilter
[ECMWF-INFO -ecsbatch] -  +++ Please report issues using the Support portal +++
[ECMWF-INFO -ecsbatch] -  +++ https://support.ecmwf.int                     +++
[ECMWF-INFO -ecsbatch] -  /usr/local/bin/ecsbatch: size: 49350, mtime: Thu Mar 14 09:29:45 2024
[ECMWF-INFO -ecsbatch] - -------------------------------------------------------------------------------------
[ECMWF-INFO -ecsbatch] - Time at submit: Wed Jun 19 07:29:51 2024 (1718782191.4708633) on ac6-209.bullx:/etc/ecmwf/nfs/dh1_home_b/eeim
[ECMWF-INFO -ecsbatch] - --- SLURM VARIABLES ---
[ECMWF-INFO -ecsbatch] - EC_CLUSTER=ac
[ECMWF-INFO -ecsbatch] - SLURM_EXPORT_ENV=ALL
[ECMWF-INFO -ecsbatch] - SBATCH_EXPORT=NONE
[ECMWF-INFO -ecsbatch] - -----------------------
[ECMWF-INFO -ecsbatch] - jobscript received on STDIN
[ECMWF-INFO -ecsbatch] - --- SCRIPT OPTIONS ---
[ECMWF-INFO -ecsbatch] - #SBATCH --output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1
[ECMWF-INFO -ecsbatch] - #SBATCH --job-name=Forecast
[ECMWF-INFO -ecsbatch] - #SBATCH --qos=np
[ECMWF-INFO -ecsbatch] - #SBATCH --signal=USR1@30
[ECMWF-INFO -ecsbatch] - #SBATCH --time=01:00:00
[ECMWF-INFO -ecsbatch] - #SBATCH --nodes=2
[ECMWF-INFO -ecsbatch] - #SBATCH --ntasks=32
[ECMWF-INFO -ecsbatch] - -----------------------
[ECMWF-INFO -ecsbatch] - --- POST-PROCESSED OPTIONS ---
[ECMWF-INFO -ecsbatch] - ARG --job_name=Forecast
[ECMWF-INFO -ecsbatch] - ARG --ntasks=32
[ECMWF-INFO -ecsbatch] - ARG --nodes=2
[ECMWF-INFO -ecsbatch] - ARG --output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1
[ECMWF-INFO -ecsbatch] - ARG --qos=np
[ECMWF-INFO -ecsbatch] - ARG --signal=USR1@30
[ECMWF-INFO -ecsbatch] - ARG --time=01:00:00
[ECMWF-INFO -ecsbatch] - ------------------------------
[ECMWF-INFO -ecsbatch] - jobtag: eeim-Forecast-2x512-/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast
[ECMWF-INFO -ecsbatch] - ['/usr/bin/sbatch', '--job-name=Forecast', '--ntasks=32', '--nodes=2', '--output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1', '--qos=np', '--signal=USR1@30', '--time=01:00:00', '--licenses=h2resw01', '--export=EC_user_time_limit=01:00:00']
[ECMWF-INFO -ecsbatch] - ecsbatch executed on ac
[ECMWF-INFO -ecsbatch] - Job queued on ac using method local
[ECMWF-INFO -ecsbatch] - Submitted batch job 38281261
[ECMWF-INFO -ecprofile] /usr/bin/bash NON_INTERACTIVE on ac1-2015 at 20240619_073000.871, PID: 2377574, JOBID: 38281261
[ECMWF-INFO -ecprofile] $SCRATCH=/ec/res4/scratch/eeim
[ECMWF-INFO -ecprofile] $PERM=/perm/eeim
[ECMWF-INFO -ecprofile] $HPCPERM=/ec/res4/hpcperm/eeim
[ECMWF-INFO -ecprofile] $TMPDIR=/dev/shm/_tmpdir_.eeim.38281261
[ECMWF-INFO -ecprofile] $SCRATCHDIR=/ec/res4/scratchdir/eeim/5/38281261

The following have been reloaded with a version change:
  1) ecmwf-toolbox/2024.04.0.0 => ecmwf-toolbox/2024.02.1.0
The following have been reloaded with a version change:
  1) hdf5/1.14.3 => hdf5/1.10.6
The following have been reloaded with a version change:
  1) netcdf4/4.9.2 => netcdf4/4.7.4
Lmod is automatically replacing "openmpi/4.1.5.4" with "hpcx-openmpi/2.9.0".
Due to MODULEPATH changes, the following have been reloaded:
  1) ecmwf-toolbox/2024.02.1.0     3) hdf5/1.10.6            5) netcdf4/4.7.4
  2) fftw/3.3.9                    4) hpcx-openmpi/2.9.0

The following have been reloaded with a version change:
  1) prgenv/gnu => prgenv/intel

2024-06-19 07:30:11 | INFO     |    Only wait 20 seconds, if the server cannot be contacted (note default is 24 hours) before failing
2024-06-19 07:30:11 | INFO     | Calling init at: 07:30:11
2024-06-19 07:30:12 | INFO     | Running task /CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast
2024-06-19 07:30:12 | INFO     | Task search path: ['/etc/ecmwf/nfs/dh1_home_b/eeim/Deode-Workflow/deode/tasks']
2024-06-19 07:30:12 | INFO     | Loading module deode.tasks.archive

< snip >

## EC_MEMINFO Detailed memory information for program /etc/ecmwf/nfs/dh1_perm_b/snh02/pack/bin/48t3_main.05.OMPIIFC2104.x/bin/MASTERODB -- wall-time :      0.714s
## EC_MEMINFO Running on 2 nodes (4-numa) with 24 compute + 4 I/O-tasks and 1+1 threads at 07:30:18.049 on 19-Jun-2024
## EC_MEMINFO The Job Name is Forecast and the Job ID is 38281261
## EC_MEMINFO 
## EC_MEMINFO                           | TC    | MEMORY USED(MB) | MEMORY FREE(MB)  -------------    -------------    -------------   INCLUDING CACHED|  %USED %HUGE  | Energy  Power
## EC_MEMINFO                           | Malloc| Inc Heap        | Numa region  0 | Numa region  1 | Numa region  2 | Numa region  3 |                |               |    (J)    (W)
## EC_MEMINFO Node Name                 | Heap  | RSS(sum)        | Small  Huge or | Small  Huge or | Small  Huge or | Small  Huge or | Total          |
## EC_MEMINFO                           | (sum) | Small    Huge   |  Only   Small  |  Only   Small  |  Only   Small  |  Only   Small  | Memfree+Cached |
## EC_MEMINFO    0 ac1-2015               33226    2379       0      4364   23238     2923   28364     2100   28714     1885   29088    243073    1312      1.0   0.0         0      0  Sm/p:oops:ifs_init
## EC_MEMINFO    1 ac1-2021               24996    1785       0      6282   21102     2647   27930     2927   27926     2391   28480    243353    1185      0.7   0.0         0      0  Sm/p:master:comput
/home/snh02/pack/48t3_main.05.OMPIIFC2104.x/bin/MASTERODB: symbol lookup error: /home/snh02/pack/48t3_main.05.OMPIIFC2104.x/bin/MASTERODB: undefined symbol: netcdf_mp_nf90_open_
srun: error: ac1-2015: task 0: Exited with exit code 127
srun: launch/slurm: _step_signal: Terminating StepId=38281261.0
slurmstepd: error: *** STEP 38281261.0 ON ac1-2015 CANCELLED AT 2024-06-19T07:30:20 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libifcoremt.so.5   0000149E3ABE478C  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  0000149E11962CF0  Unknown               Unknown  Unknown
mca_coll_libnbc.s  0000149DF69F917A  ompi_coll_libnbc_     Unknown  Unknown
libopen-pal.so.40  0000149E0E3A9324  opal_progress         Unknown  Unknown
libopen-pal.so.40  0000149E0E3AFF9D  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.1  0000149E14FD22A8  ompi_request_defa     Unknown  Unknown
libmpi.so.40.30.1  0000149E1500C858  ompi_coll_base_bc     Unknown  Unknown
mca_coll_tuned.so  0000149DF63D3320  ompi_coll_tuned_b     Unknown  Unknown
libmpi.so.40.30.1  0000149E14FE6A74  MPI_Bcast             Unknown  Unknown
libmpi_mpifh.so.4  0000149E152E1E44  pmpi_bcast            Unknown  Unknown
MASTERODB          0000000004FC6088  mpl_broadcast_mod         901  mpl_broadcast_mod.F90
MASTERODB          0000000004C0E648  easy_netcdf_read_         201  easy_netcdf_read_mpi.F90
MASTERODB          0000000002953592  yomclim_mp_read_g         100  yomclim.F90
MASTERODB          000000000217DFAB  suecrad_                 2472  suecrad.F90
MASTERODB          0000000002168CE2  suphec_                   259  suphec.F90
MASTERODB          0000000000CE4DBC  suphy_                     82  suphy.F90
MASTERODB          0000000000AD04A3  su0yomb_                  537  su0yomb.F90
MASTERODB          000000000041B08B  cnt0_                     188  cnt0.F90
MASTERODB          0000000000412A7F  MAIN__                    246  master.F90
MASTERODB          0000000000412422  Unknown               Unknown  Unknown
libc-2.28.so       0000149E115C5D85  __libc_start_main     Unknown  Unknown
MASTERODB          000000000041232E  Unknown               Unknown  Unknown

Seems that in the beginning many modules are reloaded ... is that now interfering somehow with my module environment from .bash_profile (?)

iljamal commented 5 months ago

Removing any module load from .bash_profile seems to solve that and the forecast step completed sucesfully :)

was using : module load prgenv/gnu cdo python3 nco ecmwf-toolbox module load openmpi hdf5 netcdf4

uandrae commented 5 months ago

These are very useful experiences, although painful for you. It suggests that we should perhaps make sure that batch jobs runs under a cleaner environment with the correct SBATCH directives.

Is it running now?

uandrae commented 5 months ago

In general it brings surprises if you put a lot in .bashrc/.bash_profile and you are working on different projects with different needs.