Open alexgleith opened 10 months ago
Yeah it looks like odc looks for a nodata attribute and rioxarray looks for a .rio.nodata
value so I'm inclined to support each. I will check if rioxarray respects the .nodata attribute.
I do want to clean up the loaders, they're kind of a mess right now. But part of that is pending this convo
I'm trying to work through the current handling of nodata values in the OdcLoaderMixin
, and understand how odc.stac.load
handles things at the same time.
It looks like odc.stac.load
sets the nodata attribute at the variable level, not the dataset level. This makes sense, as different variables can have different nodata values (see qa_pixel vs the other landsat bands). Which makes this assignment in the mixin redundant, I think:
But I think this may be used in places, so we should check before removing.
So if the metadata exists or the nodata
argument is passed, then each variable will have a .nodata
attribute. I don't know what happens if neither occurs.
I would like to maintain compatibility with rioxarray, which describes its approach to nodata here. They too keep track of nodata at the DataArray / variable level. Additionally, the .rio.nodata
accessor will read .attrs['nodata']
so I think we will be ok maintaining that as the goto. My only concern (and something worth testing) is if .attrs['nodata']
is set and .rio.write_nodata
is called with another value, does .attrs['nodata']
get overwritten, or are they at odds?
Another thing to consider is xarray. Officially, xarray only supports nan
as the nodata value.
An additional concern is propagation of .attrs['nodata']
(or other attributes). Even innocuous operations like da * 3
will erase the attribute (see https://docs.xarray.dev/en/stable/user-guide/computation.html#missing-values). At they link, they do suggest changing one of the global xarray options (keep_attrs
) to override this behavior, though we have to decide if we want to override a default option as a matter of course! Also see rioxarray's discussion of this.
There's a mix of compatibility in our current implementation!
I know that odc-stac
handles nodata per-band, and you can pass in information to help (this kind of config stuff). We should just hardcode that for S-2
and Landsat
probably, and then once it's done, we can just assume it's there.
Xarray is hard to influence... they do a few things that break the work we're doing. Some operations lose the CRS, for example. Being consistent with keep_attrs
is a good idea...
Maybe it's worth coming up with a consolidated picture between us, and then checking in with others like Robbi and Kirill and influencing in the right place to encourage broader consistency! That's a longer term goal, but will benefit many.
the .rio.nodata accessor will read .attrs['nodata'] so I think we will be ok maintaining that as the goto
This sounds good to me.
Right, I intend to take out a lot of our nodata handling, because odc.stac.load (and stackstac, if I ever update that interface) already handle it in a fairly robust manner. In fact, I am planning to take out of a lot of our wrappers around odc.stac.load, because they don't really help.
The only operation I'd like to keep is if data are loaded as a float type (potentially non-nan) float values are converted to nan. I may make that optional. Otherwise it's up to us to support propagation of non-nan nodata attribtues. I'll do my best, but I don't really work with non-float xarray objects, until I'm at or very close to writing.
In fact, I am planning to take out of a lot of our wrappers around odc.stac.load
Good idea :-)
I don't really work with non-float xarray objects
I think we have to in some situations, unfortunately. I really don't want to write intermediate files, and as such, want to keep everything in memory and saving half of a big number of memory makes a significant difference! The GeoMAD for Sentinel-2 for example keeps everything as uint16
and still uses something like 150
GB of memory. The Coastlines process I'm running in Vietnam is similar, at the most extreme case (hilariously with 1 year of data for S-2 and ~35 years with Landsat!)
So, I'm pretty firm on us needing both an int
and a float
path.
Well it's a matter of approach too. I almost never bring big float arrays into memory, just chunks. (That was the biggest reason I ended up modifying a lot of the existing coastlines code)
We currently have a mix of ways of handling
nodata
annotation in xarrays.Some code is using
rio-xarray
and some is using the convention of storing thenodata
value onxr.atts["nodata"
.We should ensure we consume and produce the nodata metadata the same, so we don't need multiple ways of checking, asserting, setting, and consuming nodata values.