Discussion: how to handle nodata

alexgleith commented 10 months ago

We currently have a mix of ways of handling nodata annotation in xarrays.

Some code is using rio-xarray and some is using the convention of storing the nodata value on xr.atts["nodata".

We should ensure we consume and produce the nodata metadata the same, so we don't need multiple ways of checking, asserting, setting, and consuming nodata values.

jessjaco commented 10 months ago

Yeah it looks like odc looks for a nodata attribute and rioxarray looks for a .rio.nodata value so I'm inclined to support each. I will check if rioxarray respects the .nodata attribute.

I do want to clean up the loaders, they're kind of a mess right now. But part of that is pending this convo

jessjaco commented 9 months ago

I'm trying to work through the current handling of nodata values in the OdcLoaderMixin, and understand how odc.stac.load handles things at the same time.

It looks like odc.stac.load sets the nodata attribute at the variable level, not the dataset level. This makes sense, as different variables can have different nodata values (see qa_pixel vs the other landsat bands). Which makes this assignment in the mixin redundant, I think:

https://github.com/digitalearthpacific/dep-tools/blob/e2ecdce168e18556ddd80ca7933345ea6538df40/dep_tools/loaders.py#L182-L183

But I think this may be used in places, so we should check before removing.

So if the metadata exists or the nodata argument is passed, then each variable will have a .nodata attribute. I don't know what happens if neither occurs.

I would like to maintain compatibility with rioxarray, which describes its approach to nodata here. They too keep track of nodata at the DataArray / variable level. Additionally, the .rio.nodata accessor will read .attrs['nodata'] so I think we will be ok maintaining that as the goto. My only concern (and something worth testing) is if .attrs['nodata'] is set and .rio.write_nodata is called with another value, does .attrs['nodata'] get overwritten, or are they at odds?

jessjaco commented 9 months ago

Another thing to consider is xarray. Officially, xarray only supports nan as the nodata value.

An additional concern is propagation of .attrs['nodata'] (or other attributes). Even innocuous operations like da * 3 will erase the attribute (see https://docs.xarray.dev/en/stable/user-guide/computation.html#missing-values). At they link, they do suggest changing one of the global xarray options (keep_attrs) to override this behavior, though we have to decide if we want to override a default option as a matter of course! Also see rioxarray's discussion of this.

alexgleith commented 9 months ago

There's a mix of compatibility in our current implementation!

I know that odc-stac handles nodata per-band, and you can pass in information to help (this kind of config stuff). We should just hardcode that for S-2 and Landsat probably, and then once it's done, we can just assume it's there.

Xarray is hard to influence... they do a few things that break the work we're doing. Some operations lose the CRS, for example. Being consistent with keep_attrs is a good idea...

Maybe it's worth coming up with a consolidated picture between us, and then checking in with others like Robbi and Kirill and influencing in the right place to encourage broader consistency! That's a longer term goal, but will benefit many.

the .rio.nodata accessor will read .attrs['nodata'] so I think we will be ok maintaining that as the goto

This sounds good to me.

jessjaco commented 9 months ago

Right, I intend to take out a lot of our nodata handling, because odc.stac.load (and stackstac, if I ever update that interface) already handle it in a fairly robust manner. In fact, I am planning to take out of a lot of our wrappers around odc.stac.load, because they don't really help.

The only operation I'd like to keep is if data are loaded as a float type (potentially non-nan) float values are converted to nan. I may make that optional. Otherwise it's up to us to support propagation of non-nan nodata attribtues. I'll do my best, but I don't really work with non-float xarray objects, until I'm at or very close to writing.

alexgleith commented 9 months ago

In fact, I am planning to take out of a lot of our wrappers around odc.stac.load

Good idea :-)

I don't really work with non-float xarray objects

I think we have to in some situations, unfortunately. I really don't want to write intermediate files, and as such, want to keep everything in memory and saving half of a big number of memory makes a significant difference! The GeoMAD for Sentinel-2 for example keeps everything as uint16 and still uses something like 150 GB of memory. The Coastlines process I'm running in Vietnam is similar, at the most extreme case (hilariously with 1 year of data for S-2 and ~35 years with Landsat!)

So, I'm pretty firm on us needing both an int and a float path.

jessjaco commented 9 months ago

Well it's a matter of approach too. I almost never bring big float arrays into memory, just chunks. (That was the biggest reason I ended up modifying a lot of the existing coastlines code)

digitalearthpacific / dep-tools

Discussion: how to handle nodata #31