apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.26k stars 3.59k forks source link

SIGSEGV error in docker (Java client) #14534

Closed lbenc135 closed 2 years ago

lbenc135 commented 2 years ago

Describe the bug Pulsar Java client crashes with the message below when trying to create a Pulsar client. I reproduced the crash with versions 2.9.1, 2.8.2 and 2.7.4, but same code works on 2.7.1. Also the crash doesn't happen when running on a local machine, but happens when running in a docker container (openjdk:14-alpine).

Logs:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000003fd6, pid=1, tid=7
#
# JRE version: OpenJDK Runtime Environment (14.0+33) (build 14-ea+33)
# Java VM: OpenJDK 64-Bit Server VM (14-ea+33, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  0x0000000000003fd6
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /src/services/rule_engine/core.1)
#
# An error report file with more information is saved as:
# /src/services/rule_engine/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

Full log: hs_err_pid1.log

Code:

client = PulsarClient.builder()
                        .serviceUrl(BaseSettings.PULSAR_URL)   // "pulsar://localhost:6650"
                        .loadConf(clientSettings)                             // empty HashMap
                        .build();

To Reproduce Not sure. The description should hopefully provide enough info.

Expected behavior Doesn't crash.

Desktop (please complete the following information): Full info available in the full log file (under S Y S T E M near the end).

lhotari commented 2 years ago

@lbenc135 Can you try if the same problem reproduces when you don't use Alpine based OpenJDK base image? Please test with openjdk:14.

lhotari commented 2 years ago

Possibly related to #11415 #11224 #10798

lhotari commented 2 years ago

https://github.com/netty/netty-tcnative/issues/649#issuecomment-905524734

We switched to adoptopenjdk/openjdk15:alpine-slim (instead of openjdk:15-alpine) we the problem disappeared

Since adoptopenjdk is deprecated, can you try uring eclipse-temurin:17-alpine base image to see if that works for you? Eclipse Temurin images are maintained by Adoptium and it provides pre-built OpenJDK binaries.

lbenc135 commented 2 years ago

@lhotari openjdk:14 works, but eclipse-temurin:17-alpine does not. In any case, changing the base image is a bit tricky for us. Is there any plan to fix this for alpine images?

lhotari commented 2 years ago

@lhotari openjdk:14 works, but eclipse-temurin:17-alpine does not. In any case, changing the base image is a bit tricky for us. Is there any plan to fix this for alpine images?

Since this is an open source project, it will depend on someone contributing a fix for this problem. One form of contributing is contributing a simple repro case. That could be a separate GitHub repository which contains the repro and instructions.

There might be workarounds. Some issues might be caused by shaded library versions conflicting with the application. Here's one issue about this in netty: https://github.com/netty/netty/issues/11879

For Pulsar, it's possible to use the unshaded client. The coordinates are here: https://search.maven.org/artifact/org.apache.pulsar/pulsar-client-original/2.8.2/jar

for maven

<dependency>
  <groupId>org.apache.pulsar</groupId>
  <artifactId>pulsar-client-original</artifactId>
  <version>2.8.2</version>
</dependency>

for gradle

implementation 'org.apache.pulsar:pulsar-client-original:2.8.2'

Does your application use Netty or contain shaded Netty?

lhotari commented 2 years ago

When using pulsar-client-original, you might need to also use dependencyManagement in maven or Gradle's version alignment features to ensure that there aren't mixed versions of Netty and Netty netty-tcnative-boringssl-static libraries .

For maven, something like this:

  <properties>
    <netty.version>4.1.74.Final</netty.version>
    <netty-tc-native.version>2.0.48.Final</netty-tc-native.version>
  </properties>
  <dependencyManagement>
    <dependencies>
      <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-bom</artifactId>
        <version>${netty.version}</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
      <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-tcnative-boringssl-static</artifactId>
        <version>${netty-tc-native.version}</version>
      </dependency>
    </dependencies>
  </dependencyManagement>
  <dependencies>
   <dependency>
     <groupId>org.apache.pulsar</groupId>
     <artifactId>pulsar-client-original</artifactId>
     <version>2.8.2</version>
   </dependency>
 </dependencies>

@lbenc135 Are you using maven or gradle?

lbenc135 commented 2 years ago

@lhotari Sorry for the delay. We're using Maven and the solution with pulsar-client-original and Netty dependency management worked with openjdk:14-alpine. Thanks for the tip!

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

ehenoma commented 2 years ago

Got the same issue using openjdk:17 in docker

# A fatal error has been detected by the Java Runtime Environment:
--
Wed, Jun 29 2022 5:53:04 pm | #
Wed, Jun 29 2022 5:53:04 pm | # SIGSEGV (0xb) at pc=0x0000000000003fd6, pid=7, tid=8
Wed, Jun 29 2022 5:53:04 pm | #
Wed, Jun 29 2022 5:53:04 pm | # JRE version: OpenJDK Runtime Environment (17.0+14) (build 17-ea+14)
Wed, Jun 29 2022 5:53:04 pm | # Java VM: OpenJDK 64-Bit Server VM (17-ea+14, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
Wed, Jun 29 2022 5:53:04 pm | # Problematic frame:
Wed, Jun 29 2022 5:53:04 pm | # C 0x0000000000003fd6
Wed, Jun 29 2022 5:53:04 pm | #
Wed, Jun 29 2022 5:53:04 pm | # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /usr/src/build/core.7)
Wed, Jun 29 2022 5:53:04 pm | #
Wed, Jun 29 2022 5:53:04 pm | # An error report file with more information is saved as:
Wed, Jun 29 2022 5:53:04 pm | # /usr/src/build/hs_err_pid7.log
Wed, Jun 29 2022 5:53:04 pm | #
Wed, Jun 29 2022 5:53:04 pm | # If you would like to submit a bug report, please visit:
Wed, Jun 29 2022 5:53:04 pm | # https://bugreport.java.com/bugreport/crash.jsp
Wed, Jun 29 2022 5:53:04 pm | # The crash happened outside the Java Virtual Machine in native code.
Wed, Jun 29 2022 5:53:04 pm | # See problematic frame for where to report the bug.
Wed, Jun 29 2022 5:53:04 pm | #
github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

ypzhuang commented 2 years ago

upgrade from pulsar-client-all:2.10.0 to 2.10.1, the issue gone.

tisonkun commented 2 years ago

@ypzhuang Thanks for your report. Closed as fixed.