segfault in sediment test for v1.5.0

sharon-tickell commented 11 months ago

I'm attempting to build ems v1.5.0 from this repository for RECOM, running all the tests as part of the build.

In model/tests/hd/test7/run_test7, in the "Testing 'z' model..." section, the sediment test is generating a segfault. The test output looks like:

ems-dev  | Running sediment test, takes ~ 3 minutes....
ems-dev  |              SHOC: Sparse Hydrodynamic Ocean Code
ems-dev  | EMS Version: v1.5.0
ems-dev  | Run start:   Thu Nov  9 14:48:33 2023
ems-dev  | 
ems-dev  | sed_init.c: ncol = 1200, np = 36 
ems-dev  |              SHOC: Sparse Hydrodynamic Ocean Code
ems-dev  | EMS Version: v1.5.0
ems-dev  | Run start:   Thu Nov  9 14:48:33 2023
ems-dev  | 
ems-dev  | sed_init.c: ncol = 1200, np = 36 
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]() Segmentation violation detect (simulation time = 0.0139 days)
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]() Stack trace:
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [0] shoc(radiation_stress+0x49) [0x55d134e2a989]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [1] shoc(wave_interface_step+0x82) [0x55d134be9072]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [2] shoc(auxiliary_routines+0x3f8) [0x55d134d49948]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [3] shoc(tracer_step_3d+0x145) [0x55d134d4db45]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [4] shoc(tracer_step_window+0xa8) [0x55d134d4dd08]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [5] shoc(+0x6819b) [0x55d134ba019b]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [6] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x42) [0x7f9e95a704c2]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [7] shoc(dp_tracer_step+0x16) [0x55d134ba09f6]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [8] shoc(tracer_step+0x198) [0x55d134d47ce8]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [9] shoc(hd_step+0x320) [0x55d134c42350]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [10] shoc(main+0x321) [0x55d134b84231]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7f9e9586fd0a]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [12] shoc(_start+0x2a) [0x55d134b843aa]

If I install the old v1.4.0 rev(7072) version of EMS from subversion in the exact same environment with the exact same build configuration, then all the tests pass with no errors or segfaults.

The environment is Debian 11 (the python:3.11- slim-bullseye base image) with:

libopenmpi-dev,libdap-dev, proj-bin and gdal-bin installed from apt packages
hdf5-1.12.2 built from source with --enable-parallel --enable-threadsafe --enable-unsupported
netcdf-c-4.8.1 built from source with export CPATH=/usr/lib/x86_64-linux-gnu/openmpi/include

frizwi commented 8 months ago

This is now fixed in the dev branch - @sharon-tickell, is there already a pipline in gitlab (or some other CI/CD server) that can be triggered to test?

sharon-tickell commented 8 months ago

@frizwi : Our internal CI pipeline is not yet set up to run EMS tests, as they haven't been succeeding: bit of a chicken-and-egg situation there :/

However: please see https://github.com/csiro-coasts/EMS/pull/25, for a suggested addition of containerised build-and-test support for EMS. This is derived from the test-scripts I used to discover this issue in the first place, and if you don't already have a run-everything test-harness for EMS, perhaps you might like this one?

As of now, if I run all the tests, the results show:

The segfault that was reported above in model/tests/hd/test7/run_test7 is, indeed fixed: that test is now passing.
Unfortunately model/tests/hd/test7/run_test3 is now failing with an error like double free or corruption (out) and aborts with a core dump :(
There are also failures in several of the non-hd tests, which I suspect may be due to errors in the test scripts. I'm personally less fussed about those (for now), but it would be nice if the tests released with the codebase were known to work, as it's difficult to tell if these represent actual problems or not:
- model/tests/hd-us/closed/run_closed fails with [FATAL ]() Can't map edge 265 to the surface. Check the runlog. the runlog says:
```
[2024/01/06 04:05:30]-[warn  ]() No bathymetry provided in parameter file.
[2024/01/06 04:05:30]-[warn  ]() Structured interpolation of bathy file closed_bathy.nc
[2024/01/06 04:05:30]-[warn  ]() Can't map edge between cells 1[1275000.000000 606217.782649] bathy = -9999.000000 and
[2024/01/06 04:05:30]-[warn  ]()                              0[0.000000 0.000000] bathy = 0.000000.
[2024/01/06 04:05:30]-[warn  ]() Check that bathy or try different BATHY_INTERP_RULE!
[2024/01/06 04:05:30]-[FATAL ]() Can't map edge 265 to the surface. Check the runlog.
```
- model/tests/hd-us/est/run_est fails with an error like [FATAL ]() Sponge zone for offshore ( 8.00 m) is less than mean grid size (500.00 m). Increase sponge zone.
- model/tests/hd-us/test2/open/run_open and model/tests/hd-us/test2/closed/run_closed both fail with errors like [FATAL ]() read_bathy: incorrect BATHY data : 7 (requires 0)
- model/tests/sediments/test1/run_all.sh, model/tests/sediments/test2/run_all.sh and model/tests/sediments/test3/run_all.sh all fail with errors like [FATAL ]() hd_ts_multifile_eval_sparse: The dump file 'trans.mnc(u1=u1mean)(u2=u2mean)(w=wmean)(Kz=Kzmean)(u1vm=u1vmean)(u2vm=u2vmean)' does not contain the time 865800.00.
- model/tests/tracerstats/basic/run_test fails with [FATAL ](sed:sed_init:sed_tracers_init) At least one volumetric particulate tracer must be specified

I've attached the test logfile in case that's of use: ems_test.log

sharon-tickell commented 8 months ago

@frizwi : after some more careful testing, that core dump I reported above is NOT a new regression: it also happens in v1.4 (r7072 from SVN). So that probably shouldn't stop your dev branch from being merged, since it's still a definite improvement, and enough that I am OK to try switching RECOM to the incipient v1.5.2

sharon-tickell commented 6 months ago

Testing again with the new v1.5.2 release and the same OS and library versions as I was using when the ticket was raised:

All hd tests are now passing, including model/tests/hd/test7/run_test7 (which the original ticket was raised for) and model/tests/hd/test7/run_test3 which was an issue in v1.5.1.

I'm still seeing some test failures for some hd-us and sediments tests, but none of those are segfaulting and they are a seperate issue regardless.

I'll call this one closed - thanks for getting those fixes in there!

csiro-coasts / EMS

segfault in sediment test for v1.5.0 #23