WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 69 forks source link

Python Connector Protocol client is missing retry when NotifyAck notify_success is false #3106

Closed slfritchie closed 4 years ago

slfritchie commented 4 years ago

Is this a bug, feature request, or feedback?

Bug/feature

What is the current behavior?

The Connector Protocol implementation in the Python client library, when it gets a NotifyAck message with notify_success=false, does not resend a Notify at a later time. The visible effect is that the client gets "stuck" with an existing TCP connection. The client will remain stuck until the TCP connection is closed by Wallaroo, typically by a crash.

Intermittent CI test failures such as https://circleci.com/gh/WallarooLabs/wallaroo/28539 are due to a race with this feature/bug of the client library versus a conformance test that causes the Python client to re-connect and send its first-and-only Notify attempt for the Stream IDs under test. Wallaroo will always send NotifyAck with notify_success=false when Wallaroo is in the middle of a rollback procedure. The test's disconnect & re-connect are triggered by a Wallaroo rollback, and if the rollback isn't finished before the client's Notify arrives, the test will hang.

What is the expected behavior?

Periodic retries of the Notify message needed in this case. Retries are also needed more generally for the usefulness of the Python client library.