ContinuumIO / earthio

Data reader utilities for machine learning on satellite imagery and Earth science data
32 stars 17 forks source link

Review / modify earthio/tif.py as it relates to xarray rasterio #17

Closed PeterDSteinberg closed 7 years ago

PeterDSteinberg commented 7 years ago

This PR 1260 in xarray added a rasterio backend. We need to see how this is similar or not to earthio/tif.py which uses rasterio for reading GeoTiff files.

gbrener commented 7 years ago

Worth noting that once PR https://github.com/bokeh/datashader/pull/375 is merged, datashader will rely on xarray's new rasterio backend. Since I just did the integration work there, I'll start adding comments to this issue to summarize my observations.

gbrener commented 7 years ago

My high-level observations are summarized below:

[earthio/tif.py]() xarray/backends/rasterio_.py
Returns Returns ElmStore, which references xr.DataArray objects Returns xr.DataArray
Inputs Works for single files or directory of TIFs Works for single files
Operation ElmStore wraps xr.DataArray objects which are created from rasterio.Dataset.read(.., window=..) xr.DataArray created from lazy-loaded RasterioArrayWrapper which uses rasterio.Dataset.read(..., window=..)
Metadata Preserves most of the TIF's metadata Only stores crs attribute
Memory usage Reads data into memory depending on window kwarg Reads data into memory depending on chunks kwarg (see LazyIndexedArray on top of RasterioArrayWrapper)
Additional features Supports geotransform via earthio, and anything exposed by rasterio.open and rasterio.read methods Supports chunking, caching, and concurrent access

Based on this, I think there's a lot of overlap - both are essentially achieving the same goal, although xarray seems to already be using dask to do a subset of what earthio does now without dask. Rather than add another xarray backend I think it would save us time to update tif.py to use the xarray rasterio backend instead of the rasterio API (similar to what was done with datashader), but make improvements to the xarray backend so that more metadata attributes are exposed (notice that currently we can only see the crs). This has the added benefit that we could keep the earthio geotransform logic decoupled from xarray, although there's also the drawback that we won't be able to use the rasterio API functions (namely the index, and read(..., window=..) functionality). The directory-reading functionality could be included in the logic layer above the xarray backend. What do you think @PeterDSteinberg ?

PeterDSteinberg commented 7 years ago

@gbrener Thank you for the summary - that helps. I agree this issue should become a modification of the xarray backend for tif. Can you raise an issue there, pointing back to your table of advantages/disadvantages above and the team there will probably go for the xarray changes you describe. I agree on your idea for separation of concerns (spatial logic in this repo but still making some changes in xarray). The xarray backend can stay geared towards 1 Tif file at a time, returning a DataArray and the earthio wrapper around it can call that on a directory of Tif files as is done now.

gbrener commented 7 years ago

@PeterDSteinberg I just created an issue on xarray which requests the additional attributes alluded to in the table above (https://github.com/pydata/xarray/issues/1456) and opened a PR to address it.

gbrener commented 7 years ago

The PR to add additional metadata attributes was merged into xarray. I created an interim dev release of xarray and hosted the packages on my channel, to tie over datashader development until xarray v0.9.7 comes out. The dev release can be installed with:

conda install -c gbrener xarray

The interim dev package version should be 0.9.6.dev2.

PeterDSteinberg commented 7 years ago

Closing this as we are moving it to a documentation need (#36)