FluidityProject / fluidity

Fluidity
http://fluidity-project.org
Other
365 stars 115 forks source link

Segfaults in flredecomp from longtests #315

Closed tmbgreaves closed 3 years ago

tmbgreaves commented 3 years ago

Longtests are showing reproduceable segfaults in flredecomp, so far exhibiting fairly quickly in:

Example from lagrangian_detectors_3d_2e5 :

[4b33659a299d:00363] *** Process received signal ***
[4b33659a299d:00363] Signal: Segmentation fault (11)
[4b33659a299d:00363] Signal code: Address not mapped (1)
[4b33659a299d:00363] Failing at address: 0x65
[4b33659a299d:00363] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fb4089ae3c0]
[4b33659a299d:00363] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18e7a3)[0x7fb4075f77a3]
[4b33659a299d:00363] [ 2] /lib/x86_64-linux-gnu/libgfortran.so.5(+0x268295)[0x7fb408c3f295]
[4b33659a299d:00363] [ 3] /lib/x86_64-linux-gnu/libgfortran.so.5(+0x268574)[0x7fb408c3f574]
[4b33659a299d:00363] [ 4] flredecomp(__checkpoint_MOD_checkpoint_detectors+0x297c)[0x561d5660ca8c]
[4b33659a299d:00363] [ 5] flredecomp(__checkpoint_MOD_checkpoint_simulation+0x3cc6)[0x561d56610866]
[4b33659a299d:00363] [ 6] flredecomp(flredecomp+0x811)[0x561d566067e1]
[4b33659a299d:00363] [ 7] flredecomp(main+0x66f)[0x561d566049ef]
[4b33659a299d:00363] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fb4074900b3]
[4b33659a299d:00363] [ 9] flredecomp(_start+0x2e)[0x561d56605f0e]
[4b33659a299d:00363] *** End of error message ***

Also seen in Actions, eg: https://github.com/FluidityProject/fluidity/runs/2495251930?check_suite_focus=true

Reproduce with:

docker run -it fluidity/actions:focal-686bb4ef5cd8e079edfff77598b7e55d7d117547 bash -il
git clone https://github.com/fluidityproject/longtests
./bin/testharness -f lagrangian_detectors_3d_2e5.xml
./bin/testharness -f Stokes_subduction_zone_vanKeken2008_TwoB.xml
drhodrid commented 3 years ago

Hi Tim,

There is no such thing as lagrangian_detectors these days - particles are typically lagrangian, with detectors being static. I suspect that’s why the lagrangian_detectors_3d_2e5 test is failing. I will take a look and try to update in the coming days. Where would you like me to push my changes to?

On the subduction zone case (another of mine, which I’v inherited!) - it seems to be failing when initialising fields in sub-state (i.e. over specific regions). I’ll need to tag @stephankramer here, as I think he’s come across a similar issue to this in another of our branches previously, where region ID’s are getting clobbered for some reason. Maybe I’m wrong and it’s completely unrelated, but worth asking!

R

On 4 May 2021, at 6:57 pm, Tim Greaves @.**@.>> wrote:

Longtests are showing reproduceable segfaults in flredecomp, so far exhibiting fairly quickly in:

Example from lagrangian_detectors_3d_2e5 :

[4b33659a299d:00363] Process received signal [4b33659a299d:00363] Signal: Segmentation fault (11) [4b33659a299d:00363] Signal code: Address not mapped (1) [4b33659a299d:00363] Failing at address: 0x65 [4b33659a299d:00363] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fb4089ae3c0] [4b33659a299d:00363] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18e7a3)[0x7fb4075f77a3] [4b33659a299d:00363] [ 2] /lib/x86_64-linux-gnu/libgfortran.so.5(+0x268295)[0x7fb408c3f295] [4b33659a299d:00363] [ 3] /lib/x86_64-linux-gnu/libgfortran.so.5(+0x268574)[0x7fb408c3f574] [4b33659a299d:00363] [ 4] flredecomp(__checkpoint_MOD_checkpoint_detectors+0x297c)[0x561d5660ca8c] [4b33659a299d:00363] [ 5] flredecomp(checkpoint_MOD_checkpoint_simulation+0x3cc6)[0x561d56610866] [4b33659a299d:00363] [ 6] flredecomp(flredecomp+0x811)[0x561d566067e1] [4b33659a299d:00363] [ 7] flredecomp(main+0x66f)[0x561d566049ef] [4b33659a299d:00363] [ 8] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf3)[0x7fb4074900b3] [4b33659a299d:00363] [ 9] flredecomp(_start+0x2e)[0x561d56605f0e] [4b33659a299d:00363] End of error message

Also seen in Actions, eg: https://github.com/FluidityProject/fluidity/runs/2495251930?check_suite_focus=true

Reproduce with:

docker run -it fluidity/actions:focal-686bb4ef5cd8e079edfff77598b7e55d7d117547 bash -il git clone https://github.com/fluidityproject/longtests ./bin/testharness -f lagrangian_detectors_3d_2e5.xml ./bin/testharness -f Stokes_subduction_zone_vanKeken2008_TwoB.xml

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/FluidityProject/fluidity/issues/315, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB25UKX3AZQSVUBF7IK5ZU3TL6ZINANCNFSM44CJROTA.

tmbgreaves commented 3 years ago

Thanks @drhodrid - feel free to push changes directly to the longtests repo and give me a nudge when ready to re-run. I'll open another issue with the full set of failing longtests/examples documented to give a central place for keeping track of them.

stephankramer commented 3 years ago

The segfault in flredecomp for Stokes_subduction_zone_vanKeken2008_TwoB was caused by fluidity's mishandling of python errors (see #321). The python error itself was complaining about mixed tab and space indentation which I fixed. It now complains in the final test because it tries to open a detectors file with stat_parser which is now written in hdf5, so that needs updating. @tmbgreaves : you seem to have figure out how to do that, would you mind doing the same here? I believe the same applies to lagrangian_detectors_3d_2e5

tmbgreaves commented 3 years ago

@stephankramer I fear I may not be that skilled, alas - I've been attempting fixes for the longtests which didn't get updated in the grand HDF5 changeover, but whilst I can read the files and get somewhere with the extraction of data I've not had much luck with figuring out how to pull data out of the HDF5 datasets.

I've got some way towards fixing Stokes_subduction_zone_vanKeken2008_OneA and wetting_and_drying_balzano1_dg_parallel as they had very similar tests in tests/ which were switched, but haven't managed to get either fully passing.

@angus-g - I think this is the tail end of the h5 changeover merge from a year or so ago, where all the tests in tests/ got fixed, but with the longtests not being run at the time they got missed. Might you be able to help here in switching the remaining tests across?

The list I'm aware of is one example:

hokkaido-nansei-oki_tsunami

and six longtests:

lagrangian_detectors_3d_2e5 Stokes_subduction_zone_vanKeken2008_TwoB square-convection-1e6 wetting_and_drying_balzano1_dg_parallel (partially hacked over to h5) lock_exchange_2d_Lagrangian_paths Stokes_subduction_zone_vanKeken2008_OneA (partially hacked over to h5)

stephankramer commented 3 years ago

Ah, yes sounds like @angus-g would be the more appropriate person

tmbgreaves commented 3 years ago

Alas, I am still seeing a segfault in lagrangian_detectors_3d_2e5 , see: https://github.com/FluidityProject/fluidity/runs/2633277347?check_suite_focus=true

stephankramer commented 3 years ago

This is a different segfault which appears to be caused by the fact that it uses an invalid schema (and so ends up in an unexpected situation inside flredecomp): the test still uses lagrangian particles (hint was in the name I guess) which no longer exists. These should be changed to particles, but then the test also needs updating with regards to how the detector/particle output is read in as you noticed above, so I'm hoping @angus-g can have a look at this.

angus-g commented 3 years ago

With https://github.com/FluidityProject/longtests/pull/2, I think the remaining failure is the hokkaido-nansei-oki_tsunami example. I kind of don't understand the point of the tests in it though -- the gage_error_integral function that's checked just seems to integrate the measurements in the raw_data/WaveGages.csv file, and doesn't even use the detectors (except for the timesteps, I guess?). I could do the bare minimum to get it to replicate that with the h5part file, but it seems a little pointless anyway?

stephankramer commented 3 years ago

Yeah I agree - that's a little unfortunate; it should really be testing the value of FreeSurface against those measurements - with some offset looking at what's in the plotting script - but it's kind of hard to justify spending much time on, so alternatively we could just retire it.

tmbgreaves commented 3 years ago

I'd be happy with the retiring option - if anyone wants it they can always look back in git history.

tmbgreaves commented 3 years ago

Closing this as the initial bug is fixed; discussion on dropping the tsunami example can be directed to PR #325