apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.17k stars 3.57k forks source link

[Bug] Zookeeper OutOfMemoryError After Upgrading to Pulsar 3.3.0 with Zookeeper 3.9.2 #23348

Open jamesvsshark opened 2 weeks ago

jamesvsshark commented 2 weeks ago

Search before asking

Read release policy

Version

Pulsar version: 3.3.0 Zookeeper version: 3.9.2 Kubernetes environment: Helm chart deployment Zookeeper resource configuration: Request/limit: 6GB memory, 2 CPU Heap settings: -Xms5632m -Xmx5632m GC settings:

-XX:+UseG1GC
-XX:MaxGCPauseMillis=10
-XX:+ParallelRefProcEnabled
-XX:+UnlockExperimentalVMOptions
-XX:+DoEscapeAnalysis
-XX:+DisableExplicitGC
-XX:+ExitOnOutOfMemoryError
-XX:+PerfDisableSharedMem

Minimal reproduce step

  1. Upgrade Pulsar to version 3.3.0 and Zookeeper to 3.9.2.
  2. Deploy the Zookeeper quorum using default GC and memory settings from the Helm chart.
  3. Observe memory consumption and monitor for crashes after a few days of running.

What did you expect to see?

Zookeeper should run without constantly increasing memory usage or exhausting resources.

What did you see instead?

Error observed:

java.lang.OutOfMemoryError: unable to create a native thread: possibly out of memory or process/resource limits reached.

Anything else?

Additional Context:

After downgrading Zookeeper to version 3.2.2, the OOM issue stopped, and no pod restarts occurred. Autorecovery pods running 3.3.0 are also encountering Java heap memory issues. Reviewing the Pulsar 3.3.0 release notes and PIP-324 (#22054), I suspect changes to the Alpine base image could be affecting thread creation and memory management.

Possible Solution:

It may be necessary to modify the Dockerfile to increase the stack size by setting the PTHREAD_STACK_MIN environment variable:

ENV PTHREAD_STACK_MIN 2097152

Are you willing to submit a PR?

lhotari commented 4 days ago

@jamesvsshark do you have a chance to test with 3.3.2 version of Pulsar to see if this reproduces?

lhotari commented 4 days ago

Deploy the Zookeeper quorum using default GC and memory settings from the Helm chart.

@jamesvsshark Is this a Helm chart issue instead? The support for 3.3.x requires the release that is pending. https://lists.apache.org/thread/p2fzmj31r5or65hr0yy4qgkfvnqlwzwk . Can you reproduce with that Helm chart? The defaults in the Helm chart aren't great. Contributions are welcome to improve this.

jamesvsshark commented 19 hours ago

@jamesvsshark do you have a chance to test with 3.3.2 version of Pulsar to see if this reproduces?

Yes, I can do this. I will test and report back asap.