cedadev / swallow

Swallow - a Birdhouse WPS for running the NAME Trajectory code.
Other
0 stars 1 forks source link

general forward model is segfaulting #43

Closed alaniwi closed 1 year ago

alaniwi commented 2 years ago

The general forward model has started segfaulting. Previously it was okay - not sure if we are using different inputs or if something has changed. Need to try Andrew's example (/gws/nopw/j04/name/examples/WebInterface_v8_3/GeneralForwardRun/GeneralForwardRun.sbatch).

Steps to reproduce using previously generated input file:

mkdir -m 777 /tmp/pywps_process_yf61z5l2
mkdir -m 777 /tmp/pywps_process_yf61z5l2/met_data

sudo to the service user (username withheld from public view) and run

/gws/smf/j04/cedaproc/cedawps/swallow/files/20220520_1/model/run_name.sh ~/alan-scratch/input-file-for-failed-gen-fwd-run

Fails at:

Case 1 started

 Preparing to update met and flow modules at time 01/01/2022 00:00 UTC
 System command submitted:
 /gws/smf/j04/cedaproc/cedawps/swallow/files/20220406_1/utils/MetRestore_JASMIN.
 sh /tmp/pywps_process_yf61z5l2/met_data/ MO202201010000.UMG_Mk10_I_L59PT2.pp
metrestore: looking for MO202201010000.UMG_Mk10_I_L59PT2.pp
metrestore: linking to MO202201010000.UMG_Mk10_I_L59PT2.pp in met archive
 System command submitted:
 /gws/smf/j04/cedaproc/cedawps/swallow/files/20220406_1/utils/MetRestore_JASMIN.
 sh /tmp/pywps_process_yf61z5l2/met_data/ MO202201010000.UMG_Mk10_M_L59PT2.pp
metrestore: looking for MO202201010000.UMG_Mk10_M_L59PT2.pp
metrestore: linking to MO202201010000.UMG_Mk10_M_L59PT2.pp in met archive
 NWP met data for 01/01/2022 00:00 UTC read from file
 /tmp/pywps_process_yf61z5l2/met_data/MO202201010000.UMG_Mk10_I_L59PT2.pp
 NWP met data for 01/01/2022 00:00 UTC read from file
 /tmp/pywps_process_yf61z5l2/met_data/MO202201010000.UMG_Mk10_M_L59PT2.pp
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
nameiii_64bit_par  000000000086C2A1  tbk_trace_stack_i     Unknown  Unknown
nameiii_64bit_par  000000000086A3DB  tbk_string_stack_     Unknown  Unknown
nameiii_64bit_par  0000000000812844  Unknown               Unknown  Unknown
nameiii_64bit_par  0000000000812656  tbk_stack_trace       Unknown  Unknown
nameiii_64bit_par  00000000007A5709  for__issue_diagno     Unknown  Unknown
nameiii_64bit_par  00000000007ABAB6  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B45EBCAE630  Unknown               Unknown  Unknown
nameiii_64bit_par  000000000050B01D  Unknown               Unknown  Unknown
nameiii_64bit_par  0000000000508E57  Unknown               Unknown  Unknown
nameiii_64bit_par  000000000057EC07  Unknown               Unknown  Unknown
nameiii_64bit_par  000000000055F853  Unknown               Unknown  Unknown
nameiii_64bit_par  00000000005D4E7A  Unknown               Unknown  Unknown
nameiii_64bit_par  00000000005FB585  Unknown               Unknown  Unknown
nameiii_64bit_par  00000000005FF6D2  Unknown               Unknown  Unknown
nameiii_64bit_par  0000000000600C0A  Unknown               Unknown  Unknown
nameiii_64bit_par  000000000070144C  Unknown               Unknown  Unknown
nameiii_64bit_par  00000000006FC043  Unknown               Unknown  Unknown
libiomp5.so        00002B45EB9B02A3  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        00002B45EB980407  Unknown               Unknown  Unknown
libiomp5.so        00002B45EB97FA85  Unknown               Unknown  Unknown
libiomp5.so        00002B45EB9B0783  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B45EBCA6EA5  Unknown               Unknown  Unknown
libc-2.17.so       00002B45EBFB9B0D  clone                 Unknown  Unknown
alaniwi commented 2 years ago

General yum update was recently run. This include glibc. Maybe try recompiling?

alaniwi commented 2 years ago

Recompilation did not fix it, although it then happens slightly later in the run (and without a traceback). Temporarily undoing the yum update (now restored) also did not fix it (tried with both the previously compiled and the recompiled versions of the code). Again, slightly modified the point where it fell over.

alaniwi commented 2 years ago

segfaults on sci server also

agstephens commented 2 years ago

Note: General Forward run has OpenMP on in template:

OpenMP Options:
Use OpenMP?, Threads, Particle Update Threads, Output Group Threads
        Yes,      16,                      16,                    1

We might want to scale back on number of threads - maybe to 4 instead of 16.

And we need to add this in after ulimit in this file:

# TO ADD
export OMP_STACKSIZE="32m"

In this file:

$  cat /gws/smf/j04/cedaproc/cedawps/swallow/20220617_1/model/run_name.sh
#!/bin/bash
. /gws/smf/j04/cedaproc/cedawps/swallow/files/20220617_1/model/load_modules.sh
export LD_LIBRARY_PATH=/gws/smf/j04/cedaproc/cedawps/swallow/files/20220617_1/model/lib:$LD_LIBRARY_PATH
ulimit -s unlimited
/gws/smf/j04/cedaproc/cedawps/swallow/files/20220617_1/model/nameiii_64bit_par.exe "$@"
alaniwi commented 1 year ago

This is now fixed.