corteva / rioxarray

geospatial xarray extension powered by rasterio
https://corteva.github.io/rioxarray
Other
507 stars 80 forks source link

Dask Delayed not returned from to_raster with compute=False #664

Open jjwindow opened 1 year ago

jjwindow commented 1 year ago
import rioxarray as rxr
import dask

# Open raster - problem is independent of raster source
raster = rxr.open_rasterio("raster_example.tif", chunks=True)
# Raster is split into chunks with dask as expected

# Write to disk using dask
job = raster.rio.to_raster("write_example.tif", compute=False)
# job is None, write_example.tif written anyway.
# Setting compute=True does the same thing.
# Explicit delay works fine
explicit_job = dask.delayed(raster.rio.to_raster)("write_example.tif")
dask.compute(explicit_job)  # Works normally

# Providing a lock as well as compute=False returns delayed
from threading import Lock
job = raster.rio.to_raster("write_example.tif", lock=Lock(), compute=False)  # Returns delayed

Problem description

The compute keyword in to_raster seems to be ignored in my installation, unless a lock is provided. The documentation does not mention that returning a delayed is conditional on lock being provided. From the docs, the expected behaviour is that compute=False always returns a delayed object, which is preferable when creating multiple delayed tasks to avoid explicitly making many instances of threading.Lock.

Environment Information

This example was recreated on Windows 11 using conda 23.1.0, in a conda-managed virtual environment. The environment was built with conda create -n name -c conda-forge rioxarray dask and has rioxarray==0.14.0, dask=2023.3.2. This problem has occurred for me on a different machine running Windows 10 and older versions of conda, rioxarray and dask.

snowman2 commented 1 year ago

@djhoese, do you have any thoughts?

snowman2 commented 1 year ago

Related #219

djhoese commented 1 year ago

Yeah this makes a lot of sense and at first I agreed with @jjwindow until I started re-reading the code. The unfortunate if statement is here:

https://github.com/djhoese/rioxarray/blob/413aef31e27f32a90b9f82e0731f847595869e54/rioxarray/raster_writer.py#L179

This could maybe be adjusted to say if is_dask_collection(xarray_dataarray.data) and (lock or not compute):. However, the Delayed object returned by dask needs to have that lock object already. I don't think there is a way to give the Delayed object a lock after the fact. If someone can correct me on that then I'm fine with the above if statement change.

If this is changed the docstrings updated in #219 would all need to be updated to somehow succinctly say that lock and compute control whether or not dask is used. They would also need to mention that if compute=False you still need to use a lock and how to provide it.