alxmrs / xarray-sql

An experiment to query Xarray datasets with SQL

Apache License 2.0

24 stars 3 forks source link

Two level query plan execution #10

Open alxmrs opened 7 months ago

alxmrs commented 7 months ago

One level, the fallback, would be the prototype in #8. This should always work, but is expensive since it requires compact Xarray datasets to be unraveled.

The other level would be more like xql today. It does as much pre processing on the Dataset with xr operations as possible, then trivially unravels at the end. This implies that the SQL-on-Xarray layer should have clean interface boundaries.

alxmrs commented 7 months ago

Some notes on how we could do this:

Control the sql parsing step
https://dask-sql.readthedocs.io/en/latest/how_does_it_work.html
from the sql plan, produce a refined xr.ds. The process of producing this should be a good enough effort while maintaining correctness. It might have a notion of bailing out due to ambiguity.
at last step, apply sql-dask engine on the converted, refined xr.ds.
leave open the possibility of using other sql engines on dfs via fuegue.

alxmrs commented 7 months ago

2 would become one level of execution.

alxmrs commented 7 months ago

How will we integrate the distributed execution between the two levels? For example, the Xarray executor level would use xbeam on Dataflow, whereas the Dataframe executor would use Dask on Dataproc. Is there some way we can get both sides execution on the same context? Or, in the distributed case, would we hand off the tasks via IO, like how Cubed breaks up each step by writing to Zarr?

alxmrs commented 7 months ago

Hmmm... it looks like Beam supports Pandas-like Dataframes.

https://beam.apache.org/documentation/dsls/dataframes/overview/