BlazingDB / blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
https://blazingsql.com
Apache License 2.0
1.93k stars 183 forks source link

Support predicate push down for data providers #1417

Closed aucahuasi closed 3 years ago

aucahuasi commented 3 years ago

Related to https://github.com/BlazingDB/blazingsql/issues/1395

Currently the provider API only allow us to select which columns we want to fetch, however a much flexible design would allow to allow the providers to apply a filter at the moment they are fetching data. This is specially important for sql providers!

aucahuasi commented 3 years ago

These points can help us to define a better strategy for the plan (when the predicate push down was already applied to the data):

  1. The logical filter step also uses interops, so I think I need to tell to the filter step to do not apply the filters if these were already solved by the provider (through the predicate push down)
  2. Maybe we don’t need to change the logic but instead just make more flexible that part of the engine so we support this feature... in the future we will have predicate push down for most of the providers (not only the sql ones)
  3. Additionally a good thing to let the plan without changes is that most of the them doesn’t have filter steps but only projections and scans ... so indeed if we change the logic it would be for edge cases

cc @rommelDB @felipeblazing @williamBlazing

wmalpica commented 3 years ago

A couple clarifications: There is a LogicalProject and a LogicalFilter which are relational algebra steps There are also a projections and filter clause in the BindableTableScan relational algebra step. LogicalProject and LogicalFilter both use interops. The filter step in the BindableTableScan also uses interops and usually behaves just like the LogicalFilter relational algebra step. The projections clause in BindableTableScan just tells the provider/parser what columns to fetch.

For the purposes of a SQL provider, we want the SQL provider to perform the filter from the BindableTableScan because then there is less data to parse, which is an expensive step. And if we made the SQL provider perform the filter, then we dont want interops to do the filter as well. If we did, then we would be doing a copy and an operation that would be totally unnecessary.

We do not want to change anything about the physical plan. The SQL providers have nothing to do with a LogicalFilter relational algebra step, nor a LogicalProject.

wmalpica commented 3 years ago

What we want to do is to make the filter step of the BindableTableScan be skipped in a way that is generalizable and not necessarily have to check if its a SQL provider.

I would suggest that we have the provider have a more general function like filtered() that returns a bool (true if it already provided a filter). This way its a property of the provider, and not necessarily have anything to do with SQL providers. Ultimately we would really need to look at the code to see what makes the most sense

felipeblazing commented 3 years ago

For sure a provider can implement itself anyway that the implementer seems necessary and its perfectly normal if the data is coming filtered to NOT include the filter step inside of the BindableTableScan. So long as all the changes are performed within the table scan this is fine

aucahuasi commented 3 years ago

And if we made the SQL provider perform the filter, then we dont want interops to do the filter as well. If we did, then we would be doing a copy and an operation that would be totally unnecessary.

Exactly that is my main concern, currently I already have the first implementation of the predicate push down for the mysql provider, but I'm thinking we don't need to waste resources here if the data is already filtered and evaluated

felipeblazing commented 3 years ago

To point 2. tablescans already take filters and apply them. Whether this is done by the dataprovider or using interops after data is loaded should be done on a case by case basis inside of the kernel.

aucahuasi commented 3 years ago

Great, so I would need to tell to the BindableTableScan kernel that the data was already filtered and thus avoid to perform there the filter step