[Bug] Broker became irresponsive due to too many open files error

rdhabalia commented 5 months ago

Search before asking

[X] I searched in the issues and found nothing similar.

Read release policy

[X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

>= 2.10

Minimal reproduce step

Suddenly broker log shows below error and connected producers started seeing timeout for published messages

04:05:33.877 [pulsar-stats-updater-24-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/us-east1/broker1-use1:4080/persistent/topic1-1]
 Closing inactive ledger, last-add entry 0
04:06:33.877 [pulsar-stats-updater-24-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/us-east1/broker1-use1:4080/persistent/topic1-0]
 Closing inactive ledger, last-add entry 0
04:06:33.877 [pulsar-stats-updater-24-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/us-east1/broker1-use1:4080/persistent/topic1-2]
 Closing inactive ledger, last-add entry 0
04:06:33.877 [pulsar-stats-updater-24-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/us-east1/broker1-use1:4080/persistent/topic1-1]
 Closing inactive ledger, last-add entry 0
04:07:26.759 [pulsar-acceptor-23-1] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last han
dler in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
04:07:27.760 [pulsar-acceptor-23-1] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
04:07:28.760 [pulsar-acceptor-23-1] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
04:07:29.760 [pulsar-acceptor-23-1] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
04:07:30.761 [pulsar-acceptor-23-1] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
04:07:31.761 [pulsar-acceptor-23-1] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
04:07:32.761 [pulsar-acceptor-23-1] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
04:07:33.761 [pulsar-acceptor-23-1] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files

Listing open files shows that large number of connections are in CLOSE_WAIT state but we don't see any other additional information when broker goes in that state.

What did you expect to see?

Broker should not go in such an irresponsive state.

What did you see instead?

Client started seeing publish timeout.

Anything else?

No response

Are you willing to submit a PR?

[ ] I'm willing to submit a PR! I am not willing to create PR due to duplicate efforts by multiple contributors and avoid wasting time.

lhotari commented 5 months ago

This comment explains one source of the problems: https://github.com/apache/pulsar/pull/22541#issuecomment-2071568113 . The problem hasn't been resolved. Namespace deletion is especially problematic, see comment https://github.com/apache/pulsar/pull/22541#issuecomment-2071621213 . /cc @mattisonchao @codelipenghui

rdhabalia commented 5 months ago

this issue was not due to Pulsar-admin but we saw that broker suddenly crashed due to jdk error and came back immediately. Client was keep retrying to connect the same broker by thinking broker is still the owner and broker is keep rejecting requests as broker was not owning the bundle. list open files(lsof) output shows all cnx in CLOSE_WAIT state and broker goes out of FD. There are ways to reproduce this issue and address it. But after what happened to this PR: https://github.com/apache/pulsar/pull/22841 where I had provided an explanation of the root cause and another PR with a similar approach was merged so, this time I would like to avoid making any extra efforts to create PR and I hope the community will be able to put efforts to address this issue.

apache / pulsar