csiro-coasts / EMS

Environmental Modelling Suite
Other
15 stars 5 forks source link

Segfault in SHOC init step for a RECOM run with DO_ECOLOGY enabled #28

Closed sharon-tickell closed 4 months ago

sharon-tickell commented 5 months ago

I had originally thought that this must be the same issue previously reported in https://github.com/csiro-coasts/EMS/issues/26, but unfortunately it isn't :(.

I'm using the new v1.5.2 codebase for eReefs RECOM. It's working fine for a hydro-only run, but fails for any auto.prm file with DO_ECOLOGY on (i.e. BGC runs), even though those worked fine back with the v1.5.0 code. The failure happens right at the beginning of the init step (actual command = shoc -rg auto.prm). The runlog file is created but empty (there is no explanation in the log), and a core dump file (named core.<number> is created in the working directory.

If I use gdb to analyse the dump file, I discover that the stack trace is like so:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005595aa97df9a in print_trace () at control/main.c:379
379       hd_error("Segmentation violation detect (simulation time = %.4f days)\n", master->days);
(gdb) print master
$1 = (master_t *) 0x0
(gdb) backtrace -5
#9  0x00005595aab800c7 in ecology_pre_build (eco_vars=eco_vars@entry=0x7fc23267bea8 "", eco_defs=eco_defs@entry=0x7fc23267c6a8 "standard", fp=0x5595ac15b840) at ecology.c:599
#10 0x00005595aa9e8cb2 in read_ecology (params=0x7fc23264d010, fp=fp@entry=0x5595ac15b840, ntr=0x7fc232673d7c) at inputs/readparam.c:8805
#11 0x00005595aa9fa443 in auto_params (fp=fp@entry=0x5595ac15b840, autof=<optimized out>) at inputs/readparam.c:2638
#12 0x00005595aaa2024c in hd_init (prmfd=prmfd@entry=0x5595ac15b840) at master/hd_init.c:101
#13 0x00005595aa960085 in main (argc=<optimized out>, argv=0x7ffed9d4cbc8) at control/main.c:492

So the core dump actually happened inside the segfault handler, when the code tried to dereference the master variable which was NULL - which explains the lack of log message.

But the actual segfault trigger was the next step up the chain, in the ecology_pre_build function at ecology.c:599.

If I investigate the context and local variables at that point, I get the following info, but nothing is jumping out at me as a root cause of the problem:

(gdb) frame 9
#9  0x00005595aab800c7 in ecology_pre_build (eco_vars=eco_vars@entry=0x7fc23267bea8 "", eco_defs=eco_defs@entry=0x7fc23267c6a8 "standard", fp=0x5595ac15b840) at ecology.c:599
599       read_parameter_info(e->biofname, fp, NULL, quit, warn, quiet, &e->nprm, &e->pinfo);
(gdb) list
594         e->biofname = strdup(buf);
595       else
596         e_quit("biofname not found!\n");
597
598       /* Pass in null for process file name */
599       read_parameter_info(e->biofname, fp, NULL, quit, warn, quiet, &e->nprm, &e->pinfo);
600
601       // Set up _or_add functions so that we don't hit the host model
602       e->find_index = find_index_or_add;
603       e->try_index  = try_index_or_add;
(gdb) print e->biofname
$12 = 0x5595ac19e7d0 "BGC3p1"
(gdb) print fp
$13 = (FILE *) 0x5595ac15b840
(gdb) print *fp
$14 = {_flags = -72539000, _IO_read_ptr = 0x5595ac15d0ec "DO_ECOLOGY YES\nDO_SEDIMENTS YES\nECO_VARS_ATTS standard\nPROCESSFNAME BGC3p1\n\nSED_VARS Dust FineSed Gravel-carbonate Sand-carbonate Mud-carbonate Gravel-mineral Sand-mineral Mud-mineral\nSED_VARS_ATTS st"..., _IO_read_end = 0x5595ac15d343 "",
  _IO_read_base = 0x5595ac15cb10 "#\n# eReefs RECOM auto.prm template\n# ModelRun id:53\n#\n# GBR1_H2p0\ntimeunit seconds since 2010-01-01 00:00:00 +10\noutput_timeunit days since 2010-01-01 00:00:00 +10\n\n\nstart_time 4824.0 days\nstop_time 4"...,
  _IO_write_base = 0x5595ac15cb10 "#\n# eReefs RECOM auto.prm template\n# ModelRun id:53\n#\n# GBR1_H2p0\ntimeunit seconds since 2010-01-01 00:00:00 +10\noutput_timeunit days since 2010-01-01 00:00:00 +10\n\n\nstart_time 4824.0 days\nstop_time 4"...,
  _IO_write_ptr = 0x5595ac15cb10 "#\n# eReefs RECOM auto.prm template\n# ModelRun id:53\n#\n# GBR1_H2p0\ntimeunit seconds since 2010-01-01 00:00:00 +10\noutput_timeunit days since 2010-01-01 00:00:00 +10\n\n\nstart_time 4824.0 days\nstop_time 4"...,
  _IO_write_end = 0x5595ac15cb10 "#\n# eReefs RECOM auto.prm template\n# ModelRun id:53\n#\n# GBR1_H2p0\ntimeunit seconds since 2010-01-01 00:00:00 +10\noutput_timeunit days since 2010-01-01 00:00:00 +10\n\n\nstart_time 4824.0 days\nstop_time 4"...,
  _IO_buf_base = 0x5595ac15cb10 "#\n# eReefs RECOM auto.prm template\n# ModelRun id:53\n#\n# GBR1_H2p0\ntimeunit seconds since 2010-01-01 00:00:00 +10\noutput_timeunit days since 2010-01-01 00:00:00 +10\n\n\nstart_time 4824.0 days\nstop_time 4"..., _IO_buf_end = 0x5595ac15eb10 "", _IO_save_base = 0x0, _IO_backup_base = 0x0, _IO_save_end = 0x0,
  _markers = 0x0, _chain = 0x7fc235bfc5c0 <_IO_2_1_stderr_>, _fileno = 3, _flags2 = 0, _old_offset = 29541, _cur_column = 0, _vtable_offset = 0 '\000', _shortbuf = "", _lock = 0x5595ac15b920, _offset = 2099, _codecvt = 0x7165, _wide_data = 0x5595ac15b930, _freeres_list = 0x0, _freeres_buf = 0x616c6f7369080020, __pad5 = 6579572, _mode = -1,
  _unused2 = "\000\000\000\000b\034Ep\000\000\000\000 \000\bkerne"}
(gdb) print quit
$15 = (errfn) 0x5595aac37870 <errfn_quit_default>
(gdb) print warn
$16 = (errfn) 0x5595aac377b0 <errfn_warn_default>
(gdb) print quiet
$17 = (errfn) 0x5595aac377a0 <errfn_quiet>
(gdb) print e
$18 = (ecology *) 0x5595ac15c860
(gdb) print e->nprm
$19 = 143
(gdb) print e->pinfo
$20 = (parameter_info *) 0x5595ac1a0f80
(gdb) print *e->pinfo
$21 = {index = 0, name = 0x5595ac19e790 "Tref", desc = 0x5595ac19e770 "Reference temperature", sym = 0x5595ac19e810 "T_{ref}", units = 0x5595ac19e7b0 "Deg C", value = 0x5595ac19e830, num_values = 1, stderr = 0, stringvalue = 0x5595ac1a0920 "2.000000e+01 ", ref = 0x5595ac19e850 " "}
(gdb)
sharon-tickell commented 5 months ago

If I fix the print_trace function to not attempt to dereference a null pointer, then the attempt to run shoc -rg auto.prm spits out the following to both stdout and stderr:

[2024/03/23 03:16:27]-[ERROR ]() Segmentation violation detect (unknown simulation time)
[2024/03/23 03:16:27]-[ERROR ]() Stack trace:
[2024/03/23 03:16:27]-[ERROR ]()  [0] /lib/x86_64-linux-gnu/libc.so.6(+0x1677d8) [0x7f937b09e7d8]
[2024/03/23 03:16:27]-[ERROR ]()  [1] /lib/x86_64-linux-gnu/libc.so.6(__strdup+0xe) [0x7f937afd58de]
[2024/03/23 03:16:27]-[ERROR ]()  [2] shoc(+0x26bfe1) [0x55fd41a12fe1]
[2024/03/23 03:16:27]-[ERROR ]()  [3] shoc(read_parameter_info+0x649) [0x55fd41a12b79]
[2024/03/23 03:16:27]-[ERROR ]()  [4] shoc(ecology_pre_build+0xc6) [0x55fd41a10a26]
[2024/03/23 03:16:27]-[ERROR ]()  [5] shoc(read_ecology+0x172) [0x55fd4187bf32]
[2024/03/23 03:16:27]-[ERROR ]()  [6] shoc(auto_params+0x1791) [0x55fd4188d641]
[2024/03/23 03:16:27]-[ERROR ]()  [7] shoc(hd_init+0x2c) [0x55fd418b12dc]
[2024/03/23 03:16:27]-[ERROR ]()  [8] shoc(main+0x169) [0x55fd417f4099]
[2024/03/23 03:16:27]-[ERROR ]()  [9] /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f937af5e24a]
[2024/03/23 03:16:27]-[ERROR ]()  [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f937af5e305]
[2024/03/23 03:16:27]-[ERROR ]()  [11] shoc(_start+0x21) [0x55fd417f43b1]
sharon-tickell commented 5 months ago

Experimenting with changing values in the auto.prm file: The issue is something to do with the BIOFNAME BGC3p1 setting. Including that causes the segfault, but switching it to another supported value (like porewater does NOT).

If I switch to using BGC2p0 or TASSE1p0 for both the BIOFNAME and the PROCESSFNAME, or to porewater and porewater_age respectively then I get NO segfault, but do get an error [FATAL ]() density_w: Tracers 'salt' and 'temp' must both be present. In both cases, files named bio_<BIOFNAME>.prm and processes_<PROCESSFNAME>.prm get created in the working directory.

sharon-tickell commented 5 months ago

Double checking that this isn't an issue with the subsetted forcing data: I re-ran the init step for an old successful RECOM BGC run with the new EMS code and got the same result.

sharon-tickell commented 5 months ago

Via the old-school method of injecting many print statements to find out where execution got up to, it appears that this line is the root cause: https://github.com/csiro-coasts/EMS/blob/37939b4c44705f20d74304ab8504f85154541414/model/lib/ecology/parameter_defaults.c#L2141

The defaults for the xco2_in_air parameter have been commented out at some point, but the counter is still incremented. That means the resulting parameters[3] entry is completely uninitialised, parameters[3]->value[0] is NaN, and we get a segfault as soon as the assign_string_values code attempts to access parameters[3]->stringvalue at https://github.com/csiro-coasts/EMS/blob/37939b4c44705f20d74304ab8504f85154541414/model/lib/ecology/parameter_defaults.c#L5584

sharon-tickell commented 5 months ago

Checking when this changed...

The xco2_in_air parameter was commented out along with the counter increment in this EMS v1.2.1 commit: https://github.com/csiro-coasts/EMS/commit/c2214d415f8bb91d45f91ed7a4fcf4002182e250#diff-774cde3e85b0d50efb092b03764ba543599d0926178b6cb5ed54957b57ec093c

The counter increments were moved to their own lines in this commit for the v1.5.0 release from SVN rev 7384 https://github.com/csiro-coasts/EMS/commit/ab85d5e695d56447635ec27a321d42b0463bead9#diff-774cde3e85b0d50efb092b03764ba543599d0926178b6cb5ed54957b57ec093c. The xco2_in_air counter-increment was NOT commented out, even though the rest of the initialization lines were.

Our successful RECOM BGC runs used EMS built from SVN at r7072 which is earlier than the v1.5.0 release, and did not have this problem.

sharon-tickell commented 5 months ago

Fixed by https://github.com/csiro-coasts/EMS/pull/29

frizwi commented 4 months ago

Fixed in release 1.5.3