fsprojects / pulsar-client-dotnet

Apache Pulsar native client for .NET (C#/F#/VB)
MIT License
301 stars 47 forks source link

Producer can't recover after thorw "tmsHandler-xx ConnectionHandler not connected" #248

Closed Genuineh closed 8 months ago

Genuineh commented 8 months ago

After throw, it seems still send keeplive request success, but the producer will thow it ervery time until it renew, but currently, "IsConnected" will be checked as a flag to restart the producer. And the value of "IsConnected" seems true. This error only occurs with a certain probability after the server has been running for a while, and we have not yet found a stable way to reproduce it. I think the value of "IsConnected" should be same as actual situation image image

Lanayx commented 8 months ago

@Genuineh I don't think there is an issue with producer, you most probably are getting exception when trying to do client.NewTransaction().BuildAsync(). TransactionMetaStoreHandler has it's own ConnectionHandler(similar to producer and consumer), however it's not exposed, so you can't check IsConnected on it. I expected you to have some disconnection error logs for TransactionMetaStoreHandler and it's ConnectionHandler. I wonder how this is handled in Java client.

Lanayx commented 8 months ago

I think it will be useful investigate why ConnectionHandler inside TransactionMetaStoreHandler is unable to reconnect

Genuineh commented 8 months ago

@Lanayx If we can solve the issue of why the reconnection is failing, then of course we can resolve this problem. Nevertheless, even if reconnection is successful, it could be beneficial to allow upper-layer users to independently restart to restore service because we cannot fully guarantee that no other issues will emerge throughout the lengthy process of iteration. However, keeping a channel open for users to self-recover can at least solve those problems that seem to disappear upon restarting, thus reducing the impact of certain bugs. Since I am not fully aware of the interaction strategy between the server and the client, I am also uncertain whether opening such an ingress might introduce new bugs

Genuineh commented 8 months ago

It will take some time to reproduce more detailed logs. Were there any relevant log outputs when the disconnection occurred?

Lanayx commented 8 months ago

You should check logs for ConnectionHandler and ClientCnx associated with particular TransactionMetaStoreHandler and TransactionCoordinatorClient

Genuineh commented 8 months ago

@Lanayx It will recover after error but slow, I found that there is a problem with the high availability of the cluster, and after it crashes under high pressure, the service recovery is slow, resulting in continuous errors for a long period of time. Therefore, there should not be a significant issue with the client itself. This is our own mistake. Thank you for your patience.