SIGSEGV errors while ingesting realtime data #12376

Open donatelloOo opened 5 months ago

donatelloOo commented 5 months ago

While ingesting real-time data into a single table we are periodically getting SIGSEGV fatal errors on all servers.


Table is composed of:

Below is a short extract of the core dump log (full one is attached).

# A fatal error has been detected by the Java Runtime Environment:
#  SIGSEGV (0xb) at pc=0x00007f9ba3e89056, pid=1, tid=247
# JRE version: OpenJDK Runtime Environment Corretto- ( (build
# Java VM: OpenJDK 64-Bit Server VM Corretto- (, mixed mode, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 3816 c2 org.apache.pinot.segment.local.segment.readers.PinotSegmentColumnReader.getValue(I)Ljava/lang/Object; (648 bytes) @ 0x00007f9ba3e89056 [0x00007f9ba3e88f20+0x0000000000000136]
# Core dump will be written. Default location: /opt/pinot/core.1
# If you would like to submit a bug report, please visit:
---------------  S U M M A R Y ------------

Command Line: -Xms512M -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ErrorFileToStdout -XX:-TieredCompilation -Xlog:gc*:file=/opt/pinot/gc-pinot-server.log -javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent.jar=8008:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml -Dlog4j2.configurationFile=/opt/pinot/etc/conf/pinot-server-log4j2.xml -Dplugins.dir=/opt/pinot/plugins -Dplugins.dir=/opt/pinot/plugins -Dapp.name=pinot-admin -Dapp.pid=1 -Dapp.repo=/opt/pinot/lib -Dapp.home=/opt/pinot -Dbasedir=/opt/pinot org.apache.pinot.tools.admin.PinotAdministrator StartServer -clusterName pinot -zkAddress pinot-zookeeper:2181 -configFileName /var/pinot/server/config/pinot-server.conf

Host: Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz, 16 cores, 32G, Amazon Linux release 2 (Karoo)
Time: Tue Feb  6 12:41:51 2024 UTC elapsed time: 2970.773692 seconds (0d 0h 49m 30s)

---------------  T H R E A D  ---------------

Current thread (0x00007f9a04062800):  JavaThread "obf_50d04beb9d1d306c5a5e45167656a565abc036da0763183eb9f074402fbd4f2c__0__3__20240206T1214Z" daemon [_thread_in_Java, id=247, stack(0x00007f9a264f6000,0x00007f9a265f7000)]

Stack: [0x00007f9a264f6000,0x00007f9a265f7000],  sp=0x00007f9a265f55e0,  free space=1021k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 3816 c2 org.apache.pinot.segment.local.segment.readers.PinotSegmentColumnReader.getValue(I)Ljava/lang/Object; (648 bytes) @ 0x00007f9ba3e89056 [0x00007f9ba3e88f20+0x0000000000000136]
J 3819 c2 org.apache.pinot.segment.local.segment.readers.PinotSegmentRecordReader.getRecord(ILorg/apache/pinot/spi/data/readers/GenericRow;)V (106 bytes) @ 0x00007f9ba3e726ac [0x00007f9ba3e72080+0x000000000000062c]
J 4599 c2 org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.build()V (376 bytes) @ 0x00007f9ba3dc04c4 [0x00007f9ba3dbfc60+0x0000000000000864]
j  org.apache.pinot.segment.local.realtime.converter.RealtimeSegmentConverter.build(Lorg/apache/pinot/segment/spi/creator/SegmentVersion;Lorg/apache/pinot/common/metrics/ServerMetrics;)V+282
j  org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager.buildSegmentInternal(Z)Lorg/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager$SegmentBuildDescriptor;+206
j  org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager.buildSegmentAndReplace()Z+2
j  org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager$PartitionConsumer.run()V+561
j  java.lang.Thread.run()V+11 java.base@
v  ~StubRoutines::call_stub
V  [libjvm.so+0x8e13bb]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+0x39b
V  [libjvm.so+0x8df37d]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*)+0x1ed
V  [libjvm.so+0x98ae7c]  thread_entry(JavaThread*, Thread*)+0x6c
V  [libjvm.so+0xedf730]  JavaThread::run()+0x280
V  [libjvm.so+0xedc0ff]  Thread::call_run()+0x14f
V  [libjvm.so+0xc78ea6]  thread_native_entry(Thread*)+0xe6

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007f95f0137ac8
  • Tiered Compilation was disabled to try fixing this kind of SIGSEGV errors while invoking MutableForwardIndex / OffHeapMutableDictionary

See full core dump here: SIGSEGV-obf.log

How to investigate deeper ?

snleee commented 4 months ago

@donatelloOo This looks to be some low level bug where we write & read the data from the segment. Is it possible for us to get some extra information on how to reproduce issue?

Also, if you can share the table config & schema, it would be really helpful for further investigation.

gortiz commented 4 months ago

May be related to #12286. I would suggest the same thing I suggested there. Could you try again but running Pinot with Java 17 or 21? Alternatively, could you change pinot.offheap.buffer.factory to org.apache.pinot.segment.spi.memory.unsafe.UnsafePinotBufferFactory? That change should be applied in the pinot-server.conf file.

This change would probably not fix the issue but may prevent the SIGSEV.

donatelloOo commented 4 months ago

Hi @gortiz, thanks for your answer. We already tried using java-17 amazon corretto but we got same issue. I will try with the UnsafePinotBufferFactory and provide the feedback soon.

@snleee I can work on a reproducer with shareable data/schema, I will let you know when it's available.

donatelloOo commented 4 months ago

The only way to limit such errors seems to add JIT compiler exclusions like below:
