Open jd185367 opened 2 months ago
https://github.com/apache/beam/issues/24209 solved this for RunInference
.
@liferoad while that's helpful as a pattern to look at, I don't think that solves the general issue of catching exceptions in transforms for 2 reasons:
RunInference
transform, specifically.RunInference
since most of its work was just calling a single DoFn
, so it could use the existing DoFn.with_exception_handling()
method. For transforms that call other transforms, this pattern isn't possible unless every single called transform implements this pattern (which'd require all those transforms to catch their errors this way, etc.) - which basically boils down to forcing every transform to implement its own exception-handling. That doesn't give the option of just adding a top-level error-handler (like my suggestion), which'd be more maintainable IMO.@liferoad while that's helpful as a pattern to look at, I don't think that solves the general issue of catching exceptions in transforms for 2 reasons:
- That error-wrapping only applies to the
RunInference
transform, specifically.- The error wrapping was possible for
RunInference
since most of its work was just calling a singleDoFn
, so it could use the existingDoFn.with_exception_handling()
method. For transforms that call other transforms, this pattern isn't possible unless every single called transform implements this pattern (which'd require all those transforms to catch their errors this way, etc.) - which basically boils down to forcing every transform to implement its own exception-handling. That doesn't give the option of just adding a top-level error-handler (like my suggestion), which'd be more maintainable IMO.
I agree with what you said. I just want to list the current implementations to solve the error handling. And https://github.com/apache/beam/pull/29164 introduces withBadRecordHandler
for Java to handle IO transforms.
What would you like to happen?
Add a way to handle uncaught runtime exceptions thrown within a transform to the Python SDK, e.g. something like this
with_exception_handling()
method:This already exists for DoFns in DoFn.with_exception_handling, and the Java SDK appears to offer something similar for PTransforms: https://beam.apache.org/releases/javadoc/2.15.0/index.html?org/apache/beam/sdk/transforms/WithFailures.html
Motivation
Google Cloud Dataflow will automatically re-try failed messages in streaming jobs; however, in the case of messages that cause runtime errors due to bad data/etc., this can cause messages to be retried infinitely and block other messages from being processed. The only fix we've found is to drain and re-start the pipeline to flush the bad messages, which is manual and risks losing data. There's no way to set a maximum number of retries per message. While we try to parse + validate messages up-front as much as possible, bugs have slipped through to production and caused runtime errors (and obviously, we can't prevent 100% of bugs).
Being able to add a top-level error handler to the pipeline (or a root transform) would solve this, since in a worst-case scenario we could catch any failed messages/collections, log them, and not block the rest of the pipeline.
Right now, though, adding a top-level exception handler isn't possible. For instance, this example will not catch the raised error in Apache Beam 2.56.0, which is very unintuitive:
Output:
The only solution we've found is to add this sort of error handling separately to every pipeline step, which isn't maintainable (e.g. if we have hundreds of DoFns, adding try-except blocks to all of them individually is labor-intensive).
Related Issues
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components