csiro-coasts / EMS

Environmental Modelling Suite
Other
15 stars 5 forks source link

segfault in sediment test for v1.5.0 #23

Closed sharon-tickell closed 6 months ago

sharon-tickell commented 11 months ago

I'm attempting to build ems v1.5.0 from this repository for RECOM, running all the tests as part of the build.

In model/tests/hd/test7/run_test7, in the "Testing 'z' model..." section, the sediment test is generating a segfault. The test output looks like:

ems-dev  | Running sediment test, takes ~ 3 minutes....
ems-dev  |              SHOC: Sparse Hydrodynamic Ocean Code
ems-dev  | EMS Version: v1.5.0
ems-dev  | Run start:   Thu Nov  9 14:48:33 2023
ems-dev  | 
ems-dev  | sed_init.c: ncol = 1200, np = 36 
ems-dev  |              SHOC: Sparse Hydrodynamic Ocean Code
ems-dev  | EMS Version: v1.5.0
ems-dev  | Run start:   Thu Nov  9 14:48:33 2023
ems-dev  | 
ems-dev  | sed_init.c: ncol = 1200, np = 36 
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]() Segmentation violation detect (simulation time = 0.0139 days)
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]() Stack trace:
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [0] shoc(radiation_stress+0x49) [0x55d134e2a989]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [1] shoc(wave_interface_step+0x82) [0x55d134be9072]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [2] shoc(auxiliary_routines+0x3f8) [0x55d134d49948]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [3] shoc(tracer_step_3d+0x145) [0x55d134d4db45]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [4] shoc(tracer_step_window+0xa8) [0x55d134d4dd08]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [5] shoc(+0x6819b) [0x55d134ba019b]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [6] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x42) [0x7f9e95a704c2]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [7] shoc(dp_tracer_step+0x16) [0x55d134ba09f6]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [8] shoc(tracer_step+0x198) [0x55d134d47ce8]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [9] shoc(hd_step+0x320) [0x55d134c42350]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [10] shoc(main+0x321) [0x55d134b84231]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7f9e9586fd0a]
ems-dev  | [2023/10/09 14:48:34]-[ERROR ]()  [12] shoc(_start+0x2a) [0x55d134b843aa]

If I install the old v1.4.0 rev(7072) version of EMS from subversion in the exact same environment with the exact same build configuration, then all the tests pass with no errors or segfaults.

The environment is Debian 11 (the python:3.11- slim-bullseye base image) with:

frizwi commented 8 months ago

This is now fixed in the dev branch - @sharon-tickell, is there already a pipline in gitlab (or some other CI/CD server) that can be triggered to test?

sharon-tickell commented 8 months ago

@frizwi : Our internal CI pipeline is not yet set up to run EMS tests, as they haven't been succeeding: bit of a chicken-and-egg situation there :/

However: please see https://github.com/csiro-coasts/EMS/pull/25, for a suggested addition of containerised build-and-test support for EMS. This is derived from the test-scripts I used to discover this issue in the first place, and if you don't already have a run-everything test-harness for EMS, perhaps you might like this one?

As of now, if I run all the tests, the results show:

I've attached the test logfile in case that's of use: ems_test.log

sharon-tickell commented 8 months ago

@frizwi : after some more careful testing, that core dump I reported above is NOT a new regression: it also happens in v1.4 (r7072 from SVN). So that probably shouldn't stop your dev branch from being merged, since it's still a definite improvement, and enough that I am OK to try switching RECOM to the incipient v1.5.2

sharon-tickell commented 6 months ago

Testing again with the new v1.5.2 release and the same OS and library versions as I was using when the ticket was raised:

All hd tests are now passing, including model/tests/hd/test7/run_test7 (which the original ticket was raised for) and model/tests/hd/test7/run_test3 which was an issue in v1.5.1.

I'm still seeing some test failures for some hd-us and sediments tests, but none of those are segfaulting and they are a seperate issue regardless.

I'll call this one closed - thanks for getting those fixes in there!