Open rdhabalia opened 5 months ago
This comment explains one source of the problems: https://github.com/apache/pulsar/pull/22541#issuecomment-2071568113 . The problem hasn't been resolved. Namespace deletion is especially problematic, see comment https://github.com/apache/pulsar/pull/22541#issuecomment-2071621213 . /cc @mattisonchao @codelipenghui
this issue was not due to Pulsar-admin but we saw that broker suddenly crashed due to jdk error and came back immediately. Client was keep retrying to connect the same broker by thinking broker is still the owner and broker is keep rejecting requests as broker was not owning the bundle. list open files(lsof)
output shows all cnx in CLOSE_WAIT state and broker goes out of FD.
There are ways to reproduce this issue and address it. But after what happened to this PR: https://github.com/apache/pulsar/pull/22841 where I had provided an explanation of the root cause and another PR with a similar approach was merged so, this time I would like to avoid making any extra efforts to create PR and I hope the community will be able to put efforts to address this issue.
Search before asking
Read release policy
Version
>= 2.10
Minimal reproduce step
Suddenly broker log shows below error and connected producers started seeing timeout for published messages
Listing open files shows that large number of connections are in
CLOSE_WAIT
state but we don't see any other additional information when broker goes in that state.What did you expect to see?
Broker should not go in such an irresponsive state.
What did you see instead?
Client started seeing publish timeout.
Anything else?
No response
Are you willing to submit a PR?