apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Enhance TaskContext and add task failure root cause #3410

Open mingmwang opened 1 year ago

mingmwang commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)

In current Datafusion/Ballista plan's execution path, the error result mapping is a chaos, the root cause of the task failure is not clear, especially when the plan tree contains pipeline breakers.
To make the error handling clear and make it easy to reasoning about the real task failure reason, the suggestion is to add a new method to TaskContext to set the task failure root cause into TaskContext.

Below is an example of Ballista shuffle read execution path, the arrow '->' represents the error/result return flow.

BallistaClient.fetch_partition(): BallistaError::General -> Ballista ShuffleReaderExec.fetch_partition(): DataFusionError::Execution -> Ballista ShuffleReaderExec.execute() : SendableRecordBatchStream -> ->RecordBatchStreamAdapter.poll_next(): ArrowResult::ExternalError ->

We can see the root error is mapped to Ballista Error -> DataFusion Error -> Arrow Error. And depends on how the stream is polled and consumed, the Arrow Error can be mapped to DataFusion Error or Ballista Error again. There are back and forth error mapping in the code base, this makes the root case reasoning become a challenge.

Want to hear your thoughts.

@andygrove @alamb @thinkharderdev

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

mingmwang commented 1 year ago

@yahoNanJing

alamb commented 1 year ago

I agree the wrapping is confusing and I think in general the wrapping of errors is not valuable. We see similar wrapping obfuscation in IOx as well.

The datafusion code already tries to unwrap an ArrowError here (when converting from DataFusionError --> ArrowError)

https://github.com/apache/arrow-datafusion/blob/master/datafusion/common/src/error.rs#L209-L217

I wonder if we can do something similar in Ballista? Or maybe write some "unwrapper" that would strip out the useless wrapping errors before display.

yahoNanJing commented 1 year ago

Thanks Alamb, the context here is not just for displaying. Recently we are working on error handling and error recovering which is useful for distributed execution. And we hope to do error recovering for some special error case and we need to know the root cause of the error. If an error is wrapped too much, it will be very difficult to make decision of which error to recover and others not.

alamb commented 1 year ago

And we hope to do error recovering for some special error case and we need to know the root cause of the error. If an error is wrapped too much, it will be very difficult to make decision of which error to recover and others not.

That is a neat usecase and it makes sense.

Arrow and DataFusion don't have very specific errors (unlike some other rust crate where the specific error call site often has its own unique variant). This makes coding easier as there is less error machinery to maintain, but it makes distinguishing between errors potentially more challenging -- All you have is the handful of variants and a message.

As the rest of the rust ecosystem works on stabalizing "error context" (aka being able to walk up the chain of error messages) maybe finding the root cause will become simpler.

I mostly mention this as maybe it would be a good time to figure out if we can add a source: of each ArrowError, and DataFusionError as a more general purpose method than adding such info to the TaskContext

yahoNanJing commented 1 year ago

I mostly mention this as maybe it would be a good time to figure out if we can add a source: of each ArrowError, and DataFusionError as a more general purpose method than adding such info to the TaskContext

Agree that it's a more standard way and it would be easier for error recovering purpose. @mingmwang, what do you think?

mingmwang commented 1 year ago

@alamb @yahoNanJing This would help and this approach is more rusty. But it still depends on the programer to propagate the errors correctly in the execution path. And there is another case that we still could not catch the error easily. If we add such info to TaskContext, errors can be propagated easily and there is less burden to the programer.

        tokio::spawn(async move {

        .............................
        });
alamb commented 1 year ago

But it still depends on the programer to propagate the errors correctly in the execution path. And there is another case that we still could not catch the error easily. If we add such info to TaskContext, errors can be propagated easily and there is less burden to the programer.

Yes I agree care would be needed in various cases. I am not opposed to adding information to the TaskContext -- perhaps we can brainstorm additional improvements to errors / error handling as a follow on project. I know @tustvold has some ideas in https://github.com/apache/arrow-rs/pull/2711 etc so maybe it would be part of a broader ecosystem improvment

tustvold commented 1 year ago

If we add such info to TaskContext, errors can be propagated easily and there is less burden to the programer.

Calling tokio::spawn without doing anything with the join handle is not a good idea, not only will you potentially silently drop errors, as you elude to, but more problematically you will miss panics. This has been a frequent pain point, see the work by @crepererum to fix this https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+is%3Aclosed+author%3Acrepererum. Longer term I'm hoping to move away from tokio see https://github.com/apache/arrow-datafusion/issues/2504, in part to do away with these types of problems, but I'm not sure when I'll get back to working on that. In the meantime, a tokio::spawn that doesn't somehow await the JoinHandle is an anti-pattern IMO.

additional improvements to errors / error handling

If we are to go down this route, I think the question to ask is "what users are there of the current structured error handling". In particular, could we just use something like anyhow which we already have a transitive dependency on as a result of prost... This has a fairly straightforward API for downcasting errors - https://docs.rs/anyhow/latest/anyhow/trait.Context.html#effect-on-downcasting

And we hope to do error recovering for some special error case and we need to know the root cause of the error.

@mingmwang Perhaps you could expand on the particular error cases you are looking to handle. The major one that comes to my mind are network failures, which I would have thought would make more sense to handle at the layer responsible for network calls. Perhaps see https://github.com/apache/arrow-rs/blob/master/object_store/src/client/retry.rs#L121 for inspiration??

I mostly mention this as maybe it would be a good time to figure out if we can add a source: of each ArrowError, and DataFusionError as a more general purpose method than adding such info to the TaskContext

This seems to be where the ecosystem as a whole is headed, and makes sense to me. See https://github.com/rust-lang/rust/issues/58520 for work to stabilize a mechanism to "walk" the tree

crepererum commented 1 year ago

Calling tokio::spawn without doing anything with the join handle is not a good idea, not only will you potentially silently drop errors, as you elude to, but more problematically you will miss panics.

It's not only about error handling though but also about task cancellation. Esp. in a server application, it's good to stop a query from running when the original request is cancelled. Uncaptured task handles will prevent this from happening, since tokio will still drive the execution.

alamb commented 1 year ago

See https://github.com/rust-lang/rust/issues/58520 for work to stabilize a mechanism to "walk" the tree

Yes, this is what I really want out of errors in DataFusion and Arrow

alamb commented 1 year ago

Filed https://github.com/apache/arrow-rs/issues/2725 to discuss better error handling in arrow (and thus also datafusion and ballista)