Closed Genuineh closed 8 months ago
@Genuineh I don't think there is an issue with producer, you most probably are getting exception when trying to do client.NewTransaction().BuildAsync(). TransactionMetaStoreHandler has it's own ConnectionHandler(similar to producer and consumer), however it's not exposed, so you can't check IsConnected on it. I expected you to have some disconnection error logs for TransactionMetaStoreHandler and it's ConnectionHandler. I wonder how this is handled in Java client.
I think it will be useful investigate why ConnectionHandler inside TransactionMetaStoreHandler is unable to reconnect
@Lanayx If we can solve the issue of why the reconnection is failing, then of course we can resolve this problem. Nevertheless, even if reconnection is successful, it could be beneficial to allow upper-layer users to independently restart to restore service because we cannot fully guarantee that no other issues will emerge throughout the lengthy process of iteration. However, keeping a channel open for users to self-recover can at least solve those problems that seem to disappear upon restarting, thus reducing the impact of certain bugs. Since I am not fully aware of the interaction strategy between the server and the client, I am also uncertain whether opening such an ingress might introduce new bugs
It will take some time to reproduce more detailed logs. Were there any relevant log outputs when the disconnection occurred?
You should check logs for ConnectionHandler and ClientCnx associated with particular TransactionMetaStoreHandler and TransactionCoordinatorClient
@Lanayx It will recover after error but slow, I found that there is a problem with the high availability of the cluster, and after it crashes under high pressure, the service recovery is slow, resulting in continuous errors for a long period of time. Therefore, there should not be a significant issue with the client itself. This is our own mistake. Thank you for your patience.
After throw, it seems still send keeplive request success, but the producer will thow it ervery time until it renew, but currently, "IsConnected" will be checked as a flag to restart the producer. And the value of "IsConnected" seems true. This error only occurs with a certain probability after the server has been running for a while, and we have not yet found a stable way to reproduce it. I think the value of "IsConnected" should be same as actual situation