apache / pulsar-dotpulsar

The official .NET client library for Apache Pulsar
https://pulsar.apache.org/
Apache License 2.0
234 stars 60 forks source link

ConsumerFaultedException: Timeout while inspecting metadata; this may indicate a deadlock #228

Open htbmw opened 2 months ago

htbmw commented 2 months ago

Description

I am getting a ConsumerFaultedException when my application starts up and tries to create a consumer. The full message and stacktrace can be seen in the attached screenshot. This happens when calling the GetLastMessageIds on the consumer.

I have seen this on several occasions in production after we updated the DotPulsar package to 3.3.1. Cannot recall seeing it on 3.2.1 or earlier. exception

The application runs in a pod in K8s. I stop the application when errors like this happen after retrying for a number of times, and I've seen that at some point, after many startups (controlled by the k8s deployment), the application does not run into this exception and then can continue normally. But it happens after several restart attempts and crashloopbackoffs.

Reproduction Steps

I am not sure how this can be reproduced. Have not seen this on a local environment, only in K8s clusters in production and test environments. But I suspect this could be related to the 3.3.1 DotPulsar version, but cannot 100% confirm this.

Expected behavior

Since I am not explicitly in control of any serializers under the hood of DotPulsar, I expect the package to not run into the reported deadlock situation if that is the case.

Actual behavior

Low level exception with details about a potential deadlock issue that I cannot see myself being responsible for.

Regression?

Not sure but I suspect it is happening since version 3.3.1 of the DotPulsar package.

Known Workarounds

None that I am aware of.

Configuration

No response

Other information

No response

htbmw commented 2 months ago

I am seeing this on DotPulsar 3.2.1 as well, so not specific to 3.3.1 as initially reported. This seems to be related to protobuf-net and some more information can be found here:

https://stackoverflow.com/a/17096460

Can someone please check what can be done inside DotPulsar to make it thread safe?

entvex commented 1 month ago

Hi @htbmw

Could you please provide more information on your .NET configuration:

htbmw commented 1 month ago

Hi @entvex , thanks for your request for further details.

blankensteiner commented 1 month ago

Hi @htbmw We have never seen this issue before but would like to help. As stated in the StackOverflow post, using 'PrepareSerializer' might bring about another issue. This seems to be an old issue so I guess no solution is coming from protobuf-net. We could protect the 'ProtoBuf.Serializer.Serialize' call with a lock, but I think that will hurt performance. If you can, could you create your own DotPulsar.dll after adding: static Serializer() => Serialize(new BaseCommand()); to 'DotPulsar.Internal.Serializer'? I hope this call will force protobuf-net to create stuff needed for serializing the base command so that we don't see this issue. It's a long shot, but worth a try.

entvex commented 1 month ago

Hi @htbmw.

Can you please try and see if https://www.nuget.org/packages/DotPulsar/3.3.2-rc.1 fixes the issue ?

htbmw commented 1 month ago

Hi @entvex , thanks I will give it a go and report back sometime this week.

Hi @blankensteiner, sorry for not replying sooner. I will give it a go if it is different from the fix that @entvex posted and asked me to test.

Appreciate everyone's help and suggestions so far!

blankensteiner commented 1 month ago

Hi @htbmw It's the same fix :-)