Farfetch / kafkaflow

Apache Kafka .NET Framework to create applications simple to use and extend.
https://farfetch.github.io/kafkaflow/
MIT License
638 stars 114 forks source link

[Bug Report]: Consumer worker stalls in ASP .NET Framework #563

Closed gnjack closed 4 months ago

gnjack commented 5 months ago

Prerequisites

Description

When using KafkaFlow in an IIS hosted ASP .NET Framework application, the worker loop can hang / deadlock, stopping message processing on that worker. Message consumption continues on other workers, but all message processing will eventually stop as the deadlocked worker cannot commit any offsets / its buffer fills up.

We've spent a long time looking at dumps of the stalled processes and haven't managed to pin down the exact cause.

We suspect we're experiencing async deadlocks on the ASP .NET SynchronizationContext somewhere within KafkaFlow. When adding ConfigureAwait(false) to all awaits within KafkaFlow, we no longer get any stalls / deadlocks in the consumer workers.

Steps to reproduce

We've found this very difficult to reliably reproduce, sometimes taking several hours for a worker to stall. We've never managed to reproduce a stall while debugging.

Assuming our SynchronizationContext theory is correct, this would only affect legacy ASP .NET Framework applications (or UI apps) and not modern .NET core applications as they typically have no SynchronizationContext.

Expected behavior

KafkaFlow should follow best practice for general-purpose libraries and use ConfigureAwait(false) on all awaits. This should ensure the library is safe to use where there is a thread SynchronizationContext in use. This appears to fix the deadlocks / stalls we are experiencing.

See here for more info on why libraries should use ConfigureAwait(false):

Actual behavior

One worker out of the pool can stall / deadlock within its background task processing loop, stopping message processing on that worker. Message consumption continues on other workers, but all message processing will eventually stop as the deadlocked worker cannot commit any offsets / its buffer fills up.

The stalled worker fires the MessageConsumeStarted and MessageConsumeCompleted global events for the last successfully committed message, but never fires the MessageConsumeStarted event for the next message (even when we can see there are messages in the buffer in the dump). So it seems like the stall is not happening within our middleware / handlers as it is completing the previous message and just not starting the next one.

When adding additional debug logging, the last log line printed is just before reader.WaitToReadAsync and no log is written for reader.TryRead. Again, this is when there are messages in the workers buffer in the dumps, but strangely the channel reader has no waiting readers despite the worker appearing to be stalled on reader.WaitToReadAsync.

KafkaFlow version

3.0.7