Open wolfstudy opened 2 years ago
Unfortunately, we did not observe in this exception stack, where exactly the call caused Bookie to directly overflow the memory. In a production environment, this happens all of a sudden with a running Bookie. And the ledgerStorageClass is follow:
ledgerStorageClass=org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage
DirectMemory as follows:
Most likely the problem is related to the backpressure (or lack of it). Under heavy load Bookie cannot process data fast enough and data is accumulated in memory.
There are two sides of the backpressure configuration:
Ideally, both sides have to be configured. By default the backpressure is disabled.
bookie server needs:
maxAddsInProgressLimit = ..
maxReadsInProgressLimit = ..
closeChannelOnResponseTimeout = true
waitTimeoutOnResponseBackpressureMs = ..
and client:
waitTimeoutOnBackpressureMillis = ..
Pulsar's configuration may need special prefixes (bookkeeper_
).
Other things to consider: Run Autorecovery as a separate service (not as part of bookie).
Most likely the problem is related to the backpressure (or lack of it). Under heavy load Bookie cannot process data fast enough and data is accumulated in memory.
There are two sides of the backpressure configuration:
- client: ISSUE #1086 (@bug W-4146427@) Client-side backpressure in netty (Fixes: io.netty.util.internal.OutOfDirectMemoryError under continuous heavy load) #1088
- server: Issue #1409: Added server side backpressure (@bug W-3651831@) #1410 / Apply the backpressure changes on the V2 requests #3324
Ideally, both sides have to be configured. By default the backpressure is disabled.
bookie server needs:
maxAddsInProgressLimit = .. maxReadsInProgressLimit = .. closeChannelOnResponseTimeout = true waitTimeoutOnResponseBackpressureMs = ..
and client:
waitTimeoutOnBackpressureMillis = ..
Pulsar's configuration may need special prefixes (
bookkeeper_
).Other things to consider: Run Autorecovery as a separate service (not as part of bookie).
Thanks @dlg99 's suggestion. Looking at the monitoring of JVM DirectMemory, it seems that the direct memory has leaked. We monitored the direct memory for 30 days and found that the direct memory grows exponentially and has not been released until it grows to the size of the maximum configured DirectMemory and then OOM occurs. . So I would like to further confirm whether there is a possibility of memory leak anywhere, because the exception stack does not output any information related to the code call stack, so I am not sure whether there are other good ways to locate here.
Future plans to add Netty-related memory leak checking to see if we can find where the memory leak occurs.
BUG REPORT
Describe the bug
Bookie Version:
4.14.4