Add option to drop all-nan dimension slices during data retrieval

augustinh22 commented 11 months ago

Description

Recipe execution retrieves and then continues to drag along all timesteps for which every value has no data (i.e. nan).

In certain situations, this produces results with many nan values (e.g. a Sentinel-2 imagery granule observed by multiple acquisition orbits loaded via an opendatacube connection, whereby any observation covering any part of the granule that contains the AOI returns "nan" values if no valid values are present (i.e. outside of the acquisition swath, but still in the granule/image bounds)).

Reproducible example

import semantique as sq
import datacube as dc
import json
import geopandas as gpd
#Connecting to the cube
s2c = dc.Datacube()
dcconfig = {
    "value_type_mapping": {"categorical": "ordinal", "continuous": "numerical"},
    "resamplers": {
        "nominal": "nearest",
        "ordinal": "nearest",
        "binary": "nearest",
        "numerical": "nearest",
        "categorical":"nearest",
        "continuous": "bilinear"
    }
}
with open("../layouts/factbase_austria.json", "r") as file:
    cube = sq.datacube.Opendatacube(json.load(file), connection = s2c, **dcconfig)
#spatial and temporal extents
time = sq.TemporalExtent("2021-09-01", "2021-09-30")
space = sq.SpatialExtent(gpd.read_file("area-of-interest_id68051.geojson"))
#Construct mapping
mapping = sq.mapping.Semantique()
mapping["entity"] = {}
mapping["entity"]["vegetation"] = {}
mapping["entity"]["vegetation"]["color"] = sq.appearance("Color type").evaluate("in", [sq.label("SVHNIR"), sq.label("SVLNIR"), \
                                                                                      sq.label("AVHNIR"), sq.label("AVLNIR"),\
                                                                                      sq.label("WV"), sq.label("SHV"),\
                                                                                      sq.label("SHRBRHNIR"), sq.label("SHRBRLNIR"),\
                                                                                      sq.label("HRBCR"), sq.label("WR")])
#wrapping up context
context = {
    "datacube": cube,
    "mapping": mapping,
    "space": space,
    "time": time,
    "crs": 3035,
    "tz": "UTC",
    "spatial_resolution": [-10, 10] 
}
# Recipe
recipe = sq.QueryRecipe()
recipe["vegetation_presence"] = sq.entity("vegetation").groupby(sq.self().extract("space", "feature"))
response = recipe.execute(**context)

The above query uses the following geojson, referred to as "area-of-interest_id68051.geojson", which happens to be an AOI in a Sentinel-2 granule (33UUP) that is observed by 3 different acquisition swaths, but the AOI is located in the middle, where only one swath captures it:

{"type": "FeatureCollection", "features": [{"type": "Feature", "properties": {"name": "Area-of-interest 00"}, "geometry": {"type": "Polygon", "coordinates": [[[13.011199, 47.896442], [13.011199, 47.90036], [13.018576, 47.90036], [13.018576, 47.896442], [13.011199, 47.896442]]]}}]}

This means that the results for looking at the vegetation entity from 01.09.2021 until 30.09.2021 results in 17 timesteps even though 12 of them are completely empty and full of "nan" values BEFORE analysis ever begins:

grafik

While it is possible that "nan" values could be somehow introduced by some recipe, in this case it isn't what happend so if we apply a dropna over time for where all observations are "nan", we get the result as only 5 timesteps:

grafik

Expected behavior

Timesteps that have no valid observations (i.e. where all are "nan") get dropped after being loaded from an OpenDataCube connection and thus explicitly excluded from being dragged along the analysis chain.

(Note: this, however, may be undesired in the name of keeping functionality generic and comparable across connection types looking foward. Open for discussion.)

Environment information

Example in sen2cube.at inferences

Operating system: Ubuntu
Version of semantique: 0.1.0
Versions of relevant dependencies: datacube, xarray -- in this case, I don't think the versions are necessarily relevant

Additional context

This used to be included in a much earlier, toddler version of semantique known as the "inferenceengine" using xarray.Dataset.dropna: https://github.com/whisperingpixel/iq-inferenceengine/blob/2846641407cf9a997b0e306e2f5e86db22bfcfb7/inferenceengine/tools/odctools.py#L56

It could be implemented using xarray.DataArray.dropna() maybe somewhere here: https://github.com/ZGIS/semantique/blob/80542d2ca95b046ced7f765446c21ce783670d23/semantique/datacube.py#L363C43-L363C43

augustinh22 commented 11 months ago

@luukvdmeer the real question is:

is there a specific, intentional reason why nan values are no longer dropped, or is it a bug?

luukvdmeer commented 11 months ago

I don't completely understand how the all-nan timesteps occured, but in any case, the trim verb can be used to get rid of all-nan slices along a given dimension. In this way you have full control over it, rather than that the data retriever drops dimension values without you being aware of it.

whisperingpixel commented 11 months ago

A bit more context:

The problem is that the granule-footprints are indexed, not the valid-data-footprints. Thus, you may have an AOI that is in the time series sometimes completely within the image footprint, sometimes partially, sometimes not:

Here, in the second example, there would be an all-empty array, which is not useful/correct. In the "old" inference engine, there was the "dropna" function applied after loading. I would prefer this for 2 reasons:

na values consume the same amount of memory than "real" values and in these cases we sometimes have a lot of "empty" slices, the queries consume up to 5 times more memory without having/containing any information.
This effect is an artifact that occurs in these specific constellations of storage backend indexing & Sentinel-2 grids. Given that semantique is supposed to be used by non-EO expert, I am reluctant to require a conceptually non-logial/non-semantic step to be applied by the users.

In the best case, the storage backend would not provide these images, but since we can not rely on this, it should be - in my opinion - at least a (configurable) default to drop empty slices after loading the data.

dtiede commented 11 months ago

Agree with Martin argumentation, only wanted to mention that trim works - the curves then even indicate where the observation was:

Is not solving the data amount issue since applied at the end....

luukvdmeer commented 10 months ago

Yes makes sense, this should be easy to implement

luukvdmeer commented 7 months ago

@whisperingpixel @augustinh22 @dtiede The datacube classes in semantique now have the configuration parameter "trim". If True, each retrieved array will be trimmed, meaning that all dimension coordinates with only missing/invalid data are removed. For the spatial dimensions this happens only at the edges, to maintain regularity.

After your explanations I set the default for this parameter to True, think that makes most sense. Therefore, nothing needs to change in Sen2Cube (except updating semantique :)

To get the "old" behaviour:

import semantique as sq
import datacube as dc
import json
#Connecting to the cube
s2c = dc.Datacube()
dcconfig = {
    "trim": False
}
with open("../layouts/factbase_austria.json", "r") as file:
    cube = sq.datacube.Opendatacube(json.load(file), connection = s2c, **dcconfig)

ZGIS / semantique