alxmrs / xarray-sql

An experiment to query Xarray datasets with SQL
Apache License 2.0
24 stars 3 forks source link

Two level query plan execution #10

Open alxmrs opened 7 months ago

alxmrs commented 7 months ago

One level, the fallback, would be the prototype in #8. This should always work, but is expensive since it requires compact Xarray datasets to be unraveled.

The other level would be more like xql today. It does as much pre processing on the Dataset with xr operations as possible, then trivially unravels at the end. This implies that the SQL-on-Xarray layer should have clean interface boundaries.

alxmrs commented 7 months ago

Some notes on how we could do this:

alxmrs commented 7 months ago

2 would become one level of execution.

alxmrs commented 7 months ago

How will we integrate the distributed execution between the two levels? For example, the Xarray executor level would use xbeam on Dataflow, whereas the Dataframe executor would use Dask on Dataproc. Is there some way we can get both sides execution on the same context? Or, in the distributed case, would we hand off the tasks via IO, like how Cubed breaks up each step by writing to Zarr?

alxmrs commented 7 months ago

Hmmm... it looks like Beam supports Pandas-like Dataframes.

https://beam.apache.org/documentation/dsls/dataframes/overview/