alxmrs / xarray-sql

An experiment to query Xarray datasets with SQL
Apache License 2.0
25 stars 3 forks source link

Distributed Execution on Beam #16

Open alxmrs opened 9 months ago

alxmrs commented 9 months ago

Figure out a way to distribute all layers of SQL execution #10 on Apache Beam.

alxmrs commented 9 months ago

Dataframes: https://beam.apache.org/documentation/dsls/dataframes/overview/ Xarray: Xarray-Beam

alxmrs commented 8 months ago

Beam's dataframes library supports multi indexes.

https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html

This alone makes beam worthy of an exploration sooner rather than later.

cisaacstern commented 8 months ago

Interesting!

alxmrs commented 8 months ago

Some general thoughts on this issue in no particular order:

alxmrs commented 8 months ago

This may not be feasible after all. It looks like hdf5 is intentionally not supported because it is a random access format. I think Xarray would follow this characteristic, too.

https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/dataframe/io.html

Maybe this warrants the creation of an xarray-beam-like library for pandas or dask? Can a pd.(multi)index mimic an xbeam key?

alxmrs commented 8 months ago

A core question to answer: do we really need random access?