ConduitIO / conduit

Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
https://conduit.io
Apache License 2.0
363 stars 42 forks source link

Bug: error cause for degraded pipeline might not be correct #1659

Open lovromazgon opened 1 week ago

lovromazgon commented 1 week ago

Bug description

Under certain conditions, it can happen that the error that supposedly caused a degraded pipeline to stop is not the actual error that caused the stop.

Let's imagine a running pipeline that is continuously processing records. Suddenly the source connector (plugin) experiences an error and returns an error. The issue is that returning an error closes the bidirectional stream between the connector and Conduit, meaning that records can't be passed to Conduit anymore, but also that acknowledgments can't be passed back to the connector. If there are still unprocessed records in the pipeline we essentially have a race condition at our hands - either the source node will first see the closed stream when trying to read the next record, or the acker node will experience an error when it tries to send an acknowledgment to the source connector. If the acker node is the first one to get that error, it will stop running and return the error, which will then be stored as the error that caused the stop. While that's technically correct, that error will contain just io.EOF which is not useful for the user, as it only signals that the stream stopped, and not why it stopped. The actual reason for the stop is only received when reading from the stream in the source node. That error will be logged, but it won't be seen anywhere else (e.g. in API responses or the UI).

Steps to reproduce

I have a failing test that consistently reproduces this error, I will link it here once I push the code.

Version

v0.10.1