apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Add support for SessionState in supports_filters_pushdown for a Custom Data Source #11193

Open cisaacson opened 2 days ago

cisaacson commented 2 days ago

Is your feature request related to a problem or challenge?

We need the ability to get the TaskContext.task_id any place where a Custom Data Source is invoked. As it stands currently, the state: &SessionState is available in TableProvider.scan and task_ctx: Arc<TaskContext> is available in ExecutionPlan.execute, but not in the supports_filters_pushdown. This prohibits per-query customization or tracking of external state in this method. For example if there are 3 filters for a custom table, and 10 are possible, we need to be able to choose the best one at runtime.

Further, the task_id should always be available by passing the TaskContext or from SessionState to keep things consistent.

In trying to implement this it proved infeasible because supports_filters_pushdown is in 2 interfaces in 2 separate crates: TableProvider (in core) and TableSource (in expr). It is not possible to add state: &SessionState to the TableSource implementation as it cannot access the core crate, a cyclic dependency occurs the way it is now. This was intentional to make LogicalPlan separable, which makes sense, but preventing this type of enhancement.

Describe the solution you'd like

Add &SessionState or minimally TaskContext in every pertinent method for per-query specific processing in a custom data source.

A possible way to solve this is to make a new datafusion-traits crate, and to move SessionState and other common items to datafusion-common, such that these components are used by core and expr. It will make some components available in expr that are not strictly necessary, but I think that is a good trade-off. This work could be combined with other efforts to break core into more sub-crates, that would make DataFusion much more flexible overall.

Describe alternatives you've considered

No response

Additional context

Restructuring crates in a project of this size will be a lot of work, but I believe the benefit will be there. There are other issues that also would benefit. I would recommended a separate restructure ticket that can be reviewed before any implementation is attempted. In addition then this would need to be implemented by multiple contributors, it will inevitably cause a lot of temporary breakage and retesting will also be required.

alamb commented 2 days ago

I think the usecase of passing some sort of state to TableProvider methods is a good and useful thing

https://docs.rs/datafusion/latest/datafusion/datasource/provider/trait.TableProvider.html

One way is to add a SessionState parameter as you suggest to all the methods. Given that it is already passed to scan, this seems a pretty reasonable change to me. While it would be an API change it would be very mechanical to add.

However, it seems like if we ever are going to break apart the core crate more, we'll have to figure out how to split out SessionContext. We have a related discussion here: https://github.com/apache/datafusion/issues/11182

Also related slack coversation in ASF slack: https://the-asf.slack.com/archives/C04RJ0C85UZ/p1719691754005389

cisaacson commented 2 days ago

Thanks @alamb , makes very good sense. We can separate the 2 concerns. The only issue I see is that TableSource also references supports_filters_pushdown, that is where I got stuck trying to implement that. If we can figure out how to expose SessionState to TableSource this won't be hard to do at all. Let me know if I can help. If we implement this we should also make all args as SessionState (right now execute uses TaskContext, they should be the same).

alamb commented 2 days ago

🤔

I believe the core reason for having TableSource is precisely to avoid the dependencies on SessionState (so it is possible to plan SQL without having the entire datafusion stack)

https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.TableSource.html#method.supports_filter_pushdown

cisaacson commented 2 days ago

@alamb That's right, and therein lies the conundrum... We would need to solve that somehow, that's why I suggested the restructuring. If that were possible I would be submitting a PR to do this 😄

alamb commented 2 days ago

When does the TableSource::supports_filter_pushdown get called I wonder 🤔 Is it possible just to use TableProvider

cisaacson commented 1 day ago

@alamb Good point, I could thought of that but was not sure. I found this: datafusion/expr/src/logical_plan/plan.rs which calls TableSource.supports_filters_pushdown but that is only for Display of a LogicalPlan (certainly not critical). Further, the TableSource.supports_filters_pushdown is @deprecated, so maybe the best thing is to just remove all TableSource.supports_pushdown_filters altogether? That would be nice and much easier.

If you like that idea I can experiment in my local clone of the project.

alamb commented 2 hours ago

Further, the TableSource.supports_filters_pushdown is @deprecated, so maybe the best thing is to just remove all TableSource.supports_pushdown_filters altogether? That would be nice and much easier.

If it is deprecated, I think making a PR to remove the deprecated functions would certainly be a good step forward as it would keep things simpler

cisaacson commented 1 hour ago

@alamb That would be great. I do wonder though, the TableSource is used when the LogicalPlan invokes supports_filters_pushdown, correct? I don't know enough to say if that is required, or if it is only in a PhysicalPlan which is all a custom data source cares about.