confluentinc / confluent-kafka-dotnet

Confluent's Apache Kafka .NET client
https://github.com/confluentinc/confluent-kafka-dotnet/wiki
Apache License 2.0
77 stars 866 forks source link

Consume throwing "unknown topic or partition" for topics that already exist #1766

Open michael-c-spx opened 2 years ago

michael-c-spx commented 2 years ago

Description

We've been running into an occasional issue with consumers, where, during application start, the consumer may throw "unknown topic or partition" errors from Consume, for topics that definitely already exist, and have done so for a long time (months).

I've seen a few questions in this area that relate to the 'auto.create.topics.enable' setting, this is a different issue.

We use the Admin API to check that all the topics we will subscribe to exist before subscribing, this returns that all the topics are available (as expected).

However, shortly after subscribing we see a number of the following errors (which stop after a short while by themselves):

Confluent.Kafka.ConsumeException: Broker: Unknown topic or partition
at Confluent.Kafka.Consumer`2.Consume(Int32 millisecondsTimeout)

I wonder if this may be a race condition (or perhaps a misunderstanding on our part for safety guarantees) between calls to consumer 'Subscribe' and 'Consume'. Just a guess, but the reason is that we can call 'Subscribe' multiple times in quick succession with a growing list of topics on start-up. I wonder if we are received a message (via Consume), whilst 'Subscribe' is being called and that this can cause problems?

Another relevant point in our setup is that this particular app uses a custom 'SetPartitionsAssignedHandler'. Interestingly, this is invoked AFTER we start receiving messages (should that even be the case?).

For context, the handler itself is fairly straightforward. In this app, we do not store offsets from the consumer (as we don't require messages sent when the app was down to be processed). When we start the app, we may miss a few messages between application start, and when the partitions are assigned (around 3 seconds). The handler sets the 'TopicPartitionTimestamp' value to be the time of application start, rather than the time the partitions were assigned to avoid missing those messages.

How to reproduce

Apologies as providing a full code reproduction of the issue is tricky as this manifests within a complex app. Additionally, the error does not necessarily occur reliably so a reproduction is difficult regardless.

Below is a description of what happens in order to cause the issue (list of topics obviously simplified)

The reason for the above pattern is that our apps may wish to subscribe to different topics at different times. As far as I am aware, the 'Subscribe' API on consumer only allows us to 'add' a subscription by replacing the list of subscribed topics with the new (larger) list.

Any input in understanding why this issue is occurring and what we can do about it would be greatly appreciated!

Checklist

Please provide the following information:

mhowlett commented 2 years ago

Just a guess, but the reason is that we can call 'Subscribe' multiple times in quick succession with a growing list of topics on start-up.

This is unusual, and I suspect you're hitting a bug on a little used/tested usage pattern.

Suggest collecting all topics to be subscribed to then calling subscribe only once in order to work around.

We should add some integration test cases related to this.

michael-c-spx commented 2 years ago

Thanks for the update. For our use-case we need flexibility in how (and when) apps subscribe/unsubscribe from individual topics which is where we can't really subscribe all at once.

Similar to the suggestion though, we've added some de-bounce / batching logic around the subscription calls to reduce the number of times we can hit 'Subscribe' in quick succession, which seems to workaround the issue for now.

nhaq-confluent commented 8 months ago

@spx-michael-c can this be closed?

michael-c-spx commented 7 months ago

We have a workaround but as far as I am aware this is still a bug in the underlying library, up to your triage process in terms of whether or not to close this.

If you have a specific version/changeset where you believe the issue was resolved please let me know and I can review further.