confluentinc / confluent-kafka-dotnet

Confluent's Apache Kafka .NET client
https://github.com/confluentinc/confluent-kafka-dotnet/wiki
Apache License 2.0
2.78k stars 847 forks source link

Intermittent System.IO.IOException for Schema Registry #2195

Open betmix-matt opened 4 months ago

betmix-matt commented 4 months ago

Description

We have a persistent problem with Schema Registry that we can't seem to make reliable 100% of the time. This happens both with Schema Registry in Confluent Cloud and using the Schema Registry deployed through Confluent for Kubernetes.

This happens only very infrequently (possibly less than 1% of the time) so it's very hard to reproduce consistently but it cause our integration tests to fail nearly 100% of the time because of 1 test failing out of hundreds.

Randomly we will have a request to publish fail with the following stack trace and error:

System.Threading.Tasks.TaskCanceledException : The request was canceled due to the configured HttpClient.Timeout of 30 seconds elapsing.
---- System.TimeoutException : The operation was canceled.
-------- System.Threading.Tasks.TaskCanceledException : The operation was canceled.
------------ System.IO.IOException : Unable to read data from the transport connection: Operation canceled.
---------------- System.Net.Sockets.SocketException : Operation canceled
at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Confluent.SchemaRegistry.RestService.ExecuteOnOneInstanceAsync(Func`1 createRequest)
   at Confluent.SchemaRegistry.RestService.RequestAsync[T](String endPoint, HttpMethod method, Object[] jsonBody)
   at Confluent.SchemaRegistry.RestService.LookupSchemaAsync(String subject, Schema schema, Boolean ignoreDeletedSchemas, Boolean normalize)
   at Confluent.SchemaRegistry.Serdes.ProtobufSerializer`1.<>c__DisplayClass16_0.<<RegisterOrGetReferences>b__1>d.MoveNext()
--- End of stack trace from previous location ---
   at Confluent.SchemaRegistry.Serdes.ProtobufSerializer`1.RegisterOrGetReferences(FileDescriptor fd, SerializationContext context, Boolean autoRegisterSchema, Boolean skipKnownTypes)
   at Confluent.SchemaRegistry.Serdes.ProtobufSerializer`1.<>c__DisplayClass16_0.<<RegisterOrGetReferences>b__1>d.MoveNext()
--- End of stack trace from previous location ---
   at Confluent.SchemaRegistry.Serdes.ProtobufSerializer`1.RegisterOrGetReferences(FileDescriptor fd, SerializationContext context, Boolean autoRegisterSchema, Boolean skipKnownTypes)
   at Confluent.SchemaRegistry.Serdes.ProtobufSerializer`1.SerializeAsync(T value, SerializationContext context)

How to reproduce

Publish any message that needs to make a request to Schema Registry. We have even attempted to setup our configuration so we duplicate the URLs provided to the schema registry config to allow a retry on failure, however this doesn't seem to helped.

I would love it if the Schema Registry Client had some kind of retry semantics built in so that it could handle intermittent network failures like this.

Checklist

Please provide the following information: