Open alxmrs opened 7 months ago
Some notes on how we could do this:
How will we integrate the distributed execution between the two levels? For example, the Xarray executor level would use xbeam on Dataflow, whereas the Dataframe executor would use Dask on Dataproc. Is there some way we can get both sides execution on the same context? Or, in the distributed case, would we hand off the tasks via IO, like how Cubed breaks up each step by writing to Zarr?
Hmmm... it looks like Beam supports Pandas-like Dataframes.
https://beam.apache.org/documentation/dsls/dataframes/overview/
One level, the fallback, would be the prototype in #8. This should always work, but is expensive since it requires compact Xarray datasets to be unraveled.
The other level would be more like xql today. It does as much pre processing on the Dataset with xr operations as possible, then trivially unravels at the end. This implies that the SQL-on-Xarray layer should have clean interface boundaries.