Closed sharon-tickell closed 4 months ago
If I fix the print_trace function to not attempt to dereference a null pointer, then the attempt to run shoc -rg auto.prm
spits out the following to both stdout and stderr:
[2024/03/23 03:16:27]-[ERROR ]() Segmentation violation detect (unknown simulation time)
[2024/03/23 03:16:27]-[ERROR ]() Stack trace:
[2024/03/23 03:16:27]-[ERROR ]() [0] /lib/x86_64-linux-gnu/libc.so.6(+0x1677d8) [0x7f937b09e7d8]
[2024/03/23 03:16:27]-[ERROR ]() [1] /lib/x86_64-linux-gnu/libc.so.6(__strdup+0xe) [0x7f937afd58de]
[2024/03/23 03:16:27]-[ERROR ]() [2] shoc(+0x26bfe1) [0x55fd41a12fe1]
[2024/03/23 03:16:27]-[ERROR ]() [3] shoc(read_parameter_info+0x649) [0x55fd41a12b79]
[2024/03/23 03:16:27]-[ERROR ]() [4] shoc(ecology_pre_build+0xc6) [0x55fd41a10a26]
[2024/03/23 03:16:27]-[ERROR ]() [5] shoc(read_ecology+0x172) [0x55fd4187bf32]
[2024/03/23 03:16:27]-[ERROR ]() [6] shoc(auto_params+0x1791) [0x55fd4188d641]
[2024/03/23 03:16:27]-[ERROR ]() [7] shoc(hd_init+0x2c) [0x55fd418b12dc]
[2024/03/23 03:16:27]-[ERROR ]() [8] shoc(main+0x169) [0x55fd417f4099]
[2024/03/23 03:16:27]-[ERROR ]() [9] /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f937af5e24a]
[2024/03/23 03:16:27]-[ERROR ]() [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f937af5e305]
[2024/03/23 03:16:27]-[ERROR ]() [11] shoc(_start+0x21) [0x55fd417f43b1]
Experimenting with changing values in the auto.prm file: The issue is something to do with the BIOFNAME BGC3p1
setting. Including that causes the segfault, but switching it to another supported value (like porewater
does NOT).
If I switch to using BGC2p0 or TASSE1p0
for both the BIOFNAME and the PROCESSFNAME, or to porewater
and porewater_age
respectively then I get NO segfault, but do get an error [FATAL ]() density_w: Tracers 'salt' and 'temp' must both be present
. In both cases, files named bio_<BIOFNAME>.prm
and processes_<PROCESSFNAME>.prm
get created in the working directory.
Double checking that this isn't an issue with the subsetted forcing data: I re-ran the init step for an old successful RECOM BGC run with the new EMS code and got the same result.
Via the old-school method of injecting many print statements to find out where execution got up to, it appears that this line is the root cause: https://github.com/csiro-coasts/EMS/blob/37939b4c44705f20d74304ab8504f85154541414/model/lib/ecology/parameter_defaults.c#L2141
The defaults for the xco2_in_air parameter have been commented out at some point, but the counter is still incremented. That means the resulting parameters[3]
entry is completely uninitialised, parameters[3]->value[0] is NaN, and we get a segfault as soon as the assign_string_values code attempts to access parameters[3]->stringvalue at https://github.com/csiro-coasts/EMS/blob/37939b4c44705f20d74304ab8504f85154541414/model/lib/ecology/parameter_defaults.c#L5584
Checking when this changed...
The xco2_in_air parameter was commented out along with the counter increment in this EMS v1.2.1 commit: https://github.com/csiro-coasts/EMS/commit/c2214d415f8bb91d45f91ed7a4fcf4002182e250#diff-774cde3e85b0d50efb092b03764ba543599d0926178b6cb5ed54957b57ec093c
The counter increments were moved to their own lines in this commit for the v1.5.0 release from SVN rev 7384 https://github.com/csiro-coasts/EMS/commit/ab85d5e695d56447635ec27a321d42b0463bead9#diff-774cde3e85b0d50efb092b03764ba543599d0926178b6cb5ed54957b57ec093c. The xco2_in_air counter-increment was NOT commented out, even though the rest of the initialization lines were.
Our successful RECOM BGC runs used EMS built from SVN at r7072 which is earlier than the v1.5.0 release, and did not have this problem.
Fixed in release 1.5.3
I had originally thought that this must be the same issue previously reported in https://github.com/csiro-coasts/EMS/issues/26, but unfortunately it isn't :(.
I'm using the new v1.5.2 codebase for eReefs RECOM. It's working fine for a hydro-only run, but fails for any auto.prm file with DO_ECOLOGY on (i.e. BGC runs), even though those worked fine back with the v1.5.0 code. The failure happens right at the beginning of the init step (actual command =
shoc -rg auto.prm
). The runlog file is created but empty (there is no explanation in the log), and a core dump file (namedcore.<number>
is created in the working directory.If I use
gdb
to analyse the dump file, I discover that the stack trace is like so:So the core dump actually happened inside the segfault handler, when the code tried to dereference the
master
variable which was NULL - which explains the lack of log message.But the actual segfault trigger was the next step up the chain, in the
ecology_pre_build
function atecology.c:599
.If I investigate the context and local variables at that point, I get the following info, but nothing is jumping out at me as a root cause of the problem: