Open-EO / openeo-geotrellis-extensions

Java/Scala extensions for Geotrellis, for use with OpenEO GeoPySpark backend.
Apache License 2.0
5 stars 3 forks source link

netcdf writer: segmentation fault #258

Closed jdries closed 6 months ago

jdries commented 6 months ago

the webapp seems to crash sometimes with a segmentation fault in the HDF5 library. Hypothesis is that this happens when writing multiple netcdf's concurrently.

@bossie can you add the line from segfault?

bossie commented 6 months ago

Last 4 crashes:

1: epod-openeo-1: FINISHED at Mon Feb 5 14:32:28 +0100 2024 = openeo-driver-0ff12a46d046.stdout

2: epod-openeo-2: FINISHED at Mon Feb 5 15:53:10 +0100 2024 = openeo-driver-f7eae1715bea.stdout

3: + 4: epod-openeo-1: FINISHED at Mon Feb 5 16:40:09 +0100 2024 = openeo-driver-ec10ec7d12e7.stdout epod-openeo-2: FINISHED at Mon Feb 5 16:52:34 +0100 2024 = openeo-driver-f21b6b99b520.stdout

1)

[thread 15259 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f583d3ade26, pid=60, tid=692844
#
# JRE version: OpenJDK Runtime Environment 18.9 (11.0.14+9) (build 11.0.14+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.14+9-LTS, mixed mode, tiered, g1 gc, linux-amd64)
# Problematic frame:
# C  [libhdf5.so.103+0x21ae26]  H5SL_search+0x736
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /opt/core.60)
#
# An error report file with more information is saved as:
# /opt/hs_err_pid60.log
#
# If you would like to submit a bug report, please visit:
#   https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%208&component=java-11-openjdk
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

2)

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fabf541671a, pid=60, tid=2154
#
# JRE version: OpenJDK Runtime Environment 18.9 (11.0.14+9) (build 11.0.14+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.14+9-LTS, mixed mode, tiered, g1 gc, linux-amd64)
# Problematic frame:
# C  [libhdf5.so.103+0x21a71a]  H5SL_search+0x2a
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /opt/core.60)
#
# An error report file with more information is saved as:
# /opt/hs_err_pid60.log
#
# If you would like to submit a bug report, please visit:
#   https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%208&component=java-11-openjdk
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

3)

[thread 956 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007fa0a5119e60, pid=60, tid=952
#
# JRE version: OpenJDK Runtime Environment 18.9 (11.0.14+9) (build 11.0.14+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.14+9-LTS, mixed mode, tiered, g1 gc, linux-amd64)
# Problematic frame:
# C  [libhdf5.so.103+0x11de60]  H5FL_reg_malloc+0x40
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /opt/core.60)
#
# An error report file with more information is saved as:
# /opt/hs_err_pid60.log
#
# If you would like to submit a bug report, please visit:
#   https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%208&component=java-11-openjdk
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

4)

[thread 518 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe1e421f508, pid=60, tid=641
#
# JRE version: OpenJDK Runtime Environment 18.9 (11.0.14+9) (build 11.0.14+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.14+9-LTS, mixed mode, tiered, g1 gc, linux-amd64)
# Problematic frame:
# C  [libhdf5.so.103+0x9c508]  H5CX_pop+0x58
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /opt/core.60)
#
# An error report file with more information is saved as:
# /opt/hs_err_pid60.log
#
# If you would like to submit a bug report, please visit:
#   https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%208&component=java-11-openjdk
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
jdries commented 6 months ago

HDF5 library has an 'enable_threadsafe' flag, which is not set in the rpm, enabling it would be a good first option.

jdries commented 6 months ago

I built a new HDF5 rpm with that flag set, also triggered a new build of our docker image, so this should get picked up.

jdries commented 6 months ago

netcdf4 required hdf5-hl (high level), which in turn is incompatible with the 'threadsafe' flag. HDF5 people still recommend to use this combination rather than doing multithreaded hdf5 writing: https://forum.hdfgroup.org/t/high-level-thread-safe/902/2

jdries commented 6 months ago

a more threadsafe build is integrated in the docker image closing for now, but we'll have to see if it occurs again