apache / bookkeeper

Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads
https://bookkeeper.apache.org/
Apache License 2.0
1.91k stars 903 forks source link

AutoRecovery caused DirectMemory OOM error. #3408

Open gaozhangmin opened 2 years ago

gaozhangmin commented 2 years ago

Bookkeeper Version: 4.14.4

Our prod environment went wrong last week, all bookies were killed because of direct memory OOM, this happened after one bookie's disk was broken, we tried to offline this bookie. After auditBookie triggered, all the bookies Direct Memory keep increase, it seem that, there is memory leak problem.

The ReplicateWorker log: x.x.x.x is the ip of lost bookie

022-07-14 21:50:41.721 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Could not connect to bookie: null/x.x.x.x:3181, current state CONNECTING : 
2022-07-14 21:50:41.723 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1502 from bookie: x.x.x.x:3181
2022-07-14 21:50:41.724 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
2022-07-14 21:50:41.724 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1506 from bookie: x.x.x.x:3181
2022-07-14 21:50:41.724 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
2022-07-14 21:50:41.724 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1510 from bookie: x.x.x.x:3181
2022-07-14 21:50:41.725 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
2022-07-14 21:50:41.725 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1514 from bookie: x.x.x.x:3181
2022-07-14 21:50:41.725 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
2022-07-14 21:50:41.725 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1518 from bookie: x.x.x.x:3181
2022-07-14 21:50:41.725 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running

2022-07-14 21:50:42.403 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1974 from bookie: x.x.x.x:3181
2022-07-14 21:50:42.440 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1978 from bookie: x.x.x.x:3181
2022-07-14 21:50:42.525 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1982 from bookie: x.x.x.x:3181
2022-07-14 21:50:42.593 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1986 from bookie: x.x.x.x:3181
2022-07-14 21:50:42.665 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1990 from bookie: x.x.x.x:3181
2022-07-14 21:50:42.706 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1994 from bookie: x.x.x.x:3181
2022-07-14 21:50:42.776 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1998 from bookie: x.x.x.x:3181
2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running

2022-07-14 22:10:19.830 [BookKeeperClientWorker-OrderedExecutor-41-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie operation timeout while reading L60558419 E359 from bookie: 10.71.168.13:3181
2022-07-14 22:10:19.830 [BookKeeperClientWorker-OrderedExecutor-41-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error reading ledger entry: 434
2022-07-14 22:10:19.831 [BookKeeperClientWorker-OrderedExecutor-41-0] ERROR org.apache.bookkeeper.proto.BookkeeperInternalCallbacks - Error in multi callback : -23

is (-1, rc = null)
2022-07-14 22:10:19.830 [BookKeeperClientWorker-OrderedExecutor-41-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie operation timeout while reading L60558419 E378 from bookie: 1.1.1.1:3181
2022-07-14 22:10:19.830 [BookKeeperClientWorker-OrderedExecutor-41-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L60558419 E378-E378, Sent to [x.x.x.x:3181, 1.1.1.1:3181], Heard from [] : bitset = {}, Error = 'Bookie operation timeout'. First unread entry is (-1, rc = null)

And there are bookies quarantined by brokers continuous, all bookies are crashed at last.

Note: autoRecoveryDaemonEnabled=true, we run autoRecovery service inside the bookie service. After the bookie was lost because of disk failure, we also run a decommission command to offline this bookie. (I thinks it s a duplicate trigger of auditing bookie)

image

NUM_ENTRIES_UNABLE_TO_READ_FOR_REPLICATION metrics image

dlg99 commented 2 years ago

Do you have rereplicationEntryBatchSize set to a large value? try setting it back to default (10).

gaozhangmin commented 2 years ago

@dlg99 it's 500, it's because of this setting?

StevenLuMT commented 2 years ago

can you provide a bookie configuration? I help you analyze it @gaozhangmin

horizonzy commented 2 years ago

After research for a long time. We found that is bookie problem, the request from ReplicationWorker is too many.

The shutdown bookie holds many ledgers, when it shutdown, the Auditor mark many ledgers to underreplication. And there are many ReplicationWorker to replicate ledger, the config rereplicationEntryBatchSize is 500, so every ReplicationWorker will send 500 read request to bookie servers, so the bookie server receives lots of reuqest, it will allocate direct memory for reuqest.

The release operation is not catching up allocate operation, so the PoolChunk is more and more until it reach maxDirectMemory.

@gaozhangmin supply two heap dumps file, the less is dumpped when replicate operation start, The more file is dumpped when the replicate for a while.

less.hprof.zip more.hprof.zip

I found that PoolChunk is 244 in more, 120 in less. The PoolChunk direct memory is 4M in bookie, so it increase 124 * 4M direct memory than less.

And there is another issue we found, if user config DbLedgerStorage, when it start, it will occupy 1/2 direct memory for readCache and writeCache, it's unpooled but cuupy direct memory.

In the Direct memory pool, it only has 1/2 direct memory to allocate, it will cause oom easier.

horizonzy commented 2 years ago

And we found the io.netty.buffer.PooledByteBufAllocator#directArenas array size is 80, it also cause oom easier.

image
horizonzy commented 2 years ago

So this is a normal case, you can reduce rereplicationEntryBatchSize value to decrease bookie server reque frequency. If no more confusion, can be close.

yapxue commented 2 years ago

@horizonzy I also suffered OutOfDirectMemoryError when doing performance test. I send very huge writes and bookie will OOM and restart all the time. But in my case, replica recovery is disabled. If it is caused by allocation fatser than release, why the direct memory is still not released after a long time even if I stopped prodcuer and entries rate is 0. Here is a snapshot of direct memory usage.

Screen_Shot_2022-07-31_at_4_30_24_PM
horizonzy commented 2 years ago

BytBuf.release maybe not free PoolChunk. If The ByteBuf is cached by PoolThreadCache. The direct memory can't be release.

StevenLuMT commented 2 years ago

@horizonzy I also suffered OutOfDirectMemoryError when doing performance test. I send very huge writes and bookie will OOM and restart all the time. But in my case, replica recovery is disabled. If it is caused by allocation fatser than release, why the direct memory is still not released after a long time even if I stopped prodcuer and entries rate is 0. Here is a snapshot of direct memory usage. Screen_Shot_2022-07-31_at_4_30_24_PM

Your scene is different,I will discuss the details with you offline @yapxue

jimmycxm commented 2 years ago

@horizonzy I also suffered OutOfDirectMemoryError when doing performance test. I send very huge writes and bookie will OOM and restart all the time. But in my case, replica recovery is disabled. If it is caused by allocation fatser than release, why the direct memory is still not released after a long time even if I stopped prodcuer and entries rate is 0. Here is a snapshot of direct memory usage. Screen_Shot_2022-07-31_at_4_30_24_PM

Got same issue. Is the problem resolved?