alxmrs / xarray-sql

An experiment to query Xarray datasets with SQL
Apache License 2.0
24 stars 3 forks source link

Support SQL-Style Joins between Xarray datasets and Dask/Pandas dataframes #5

Open alxmrs opened 9 months ago

alxmrs commented 9 months ago

Here's an example workflow that I'd like to support once this feature exists. This is from Jake Wall of the Mara Elephant Project. Here, he would make use of raster and table data from Earth Engine.

Yeah, so one example, is to extract a NDVI value from an IC for every GPS point recorded by an elephant. We have millions of points that get translated into features. Then a reduce operation is run on the point to get the closest n values in time to when the GPS point occurred. We then spit this back out as an array and join it with the original geopandas dataframe.

I'm imagining this would look like a left join from a Dask Dataframe that had the elephant coordinates to an EE ImageCollection that was opened with Xee via Qarray. Some details are fuzzy, like how we'd interject a NN lookup (maybe, this could be done via a SQL aggregation?).

In general, I think there is broad demand for being able to join raster and tabular data with each other. Later in the line, I bet we could implement geo-aware joins that would make use of geometry.

alxmrs commented 9 months ago

This should be possible to demo once #8 is complete. If we figure this out, we should document it in the README.

alxmrs commented 2 months ago

I’ve been reading more into how this is done in the status quo. The best example I can find for joining rasters and point data (and vectors) comes from using a hierarchical spatial index like h3 or s2.

https://github.com/uber/h3-py-notebooks/blob/master/notebooks/unified_data_layers.ipynb

I wonder if this is the technique that underpins Fused.io.

alxmrs commented 2 months ago

For non-geospatial data, could we use a kdtree to create a hierarchical index? 🤔

alxmrs commented 2 months ago

This podcast episode is incredibly validating of the use case that this library (and issue) solves.

https://overcast.fm/+AAU1XJb7r0Y/6:55

alxmrs commented 1 month ago

https://github.com/DahnJ/H3-Pandas

This gives me more confidence that an index system (geospatial via s2 and h3, or pre-computed via kdtrees) is a good integration. To me, this is proof of demand for such features.