Closed aucahuasi closed 3 years ago
These points can help us to define a better strategy for the plan (when the predicate push down was already applied to the data):
cc @rommelDB @felipeblazing @williamBlazing
A couple clarifications: There is a LogicalProject and a LogicalFilter which are relational algebra steps There are also a projections and filter clause in the BindableTableScan relational algebra step. LogicalProject and LogicalFilter both use interops. The filter step in the BindableTableScan also uses interops and usually behaves just like the LogicalFilter relational algebra step. The projections clause in BindableTableScan just tells the provider/parser what columns to fetch.
For the purposes of a SQL provider, we want the SQL provider to perform the filter from the BindableTableScan because then there is less data to parse, which is an expensive step. And if we made the SQL provider perform the filter, then we dont want interops to do the filter as well. If we did, then we would be doing a copy and an operation that would be totally unnecessary.
We do not want to change anything about the physical plan. The SQL providers have nothing to do with a LogicalFilter relational algebra step, nor a LogicalProject.
What we want to do is to make the filter step of the BindableTableScan be skipped in a way that is generalizable and not necessarily have to check if its a SQL provider.
I would suggest that we have the provider have a more general function like filtered()
that returns a bool (true if it already provided a filter). This way its a property of the provider, and not necessarily have anything to do with SQL providers. Ultimately we would really need to look at the code to see what makes the most sense
For sure a provider can implement itself anyway that the implementer seems necessary and its perfectly normal if the data is coming filtered to NOT include the filter step inside of the BindableTableScan. So long as all the changes are performed within the table scan this is fine
And if we made the SQL provider perform the filter, then we dont want interops to do the filter as well. If we did, then we would be doing a copy and an operation that would be totally unnecessary.
Exactly that is my main concern, currently I already have the first implementation of the predicate push down for the mysql provider, but I'm thinking we don't need to waste resources here if the data is already filtered and evaluated
To point 2. tablescans already take filters and apply them. Whether this is done by the dataprovider or using interops after data is loaded should be done on a case by case basis inside of the kernel.
Great, so I would need to tell to the BindableTableScan kernel that the data was already filtered and thus avoid to perform there the filter step
Related to https://github.com/BlazingDB/blazingsql/issues/1395
Currently the provider API only allow us to select which columns we want to fetch, however a much flexible design would allow to allow the providers to apply a filter at the moment they are fetching data. This is specially important for sql providers!