apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.86k stars 4.25k forks source link

[Bug]: Python sdk data plane can become stuck on instructions that had an exception while calling BundleProcessor setup #31571

Open scwhittle opened 4 months ago

scwhittle commented 4 months ago

What happened?

When receiving elements over the grpc stream, the data plane dispatches to per-bundle queues which have a maximum size. Thus if nothing consumes from the queue the data plane can become blocked forever.

https://github.com/apache/beam/commit/216f0d9f80c3f2a169e139a0818a1b6b059f3219 fixed this issue when an exception was triggered during processing of data for a instruction by removing the channel and remembering that it had been cleaned up. https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/data_plane.py#L578

However if an exception occurs during setup of the bundle processor, _clean_receiving_queue is not invoked. https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py#L687 just appears to propagate the failure to the response on the control stream: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py#L312

This stuckness was observed in Python 2.38 but it seems like it could still trigger by exceptions during bundle processor setup (which currently includes user DoFn setup method).

Some possible fixes:

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

scwhittle commented 1 month ago

@tvalentyn what are your thoughts on how to best fix this?