Open-EO / openeo-python-client

Python client API for OpenEO
https://open-eo.github.io/openeo-python-client/
Apache License 2.0
146 stars 37 forks source link

datacube.dimension_labels() returns datacube instead of array of strings #447

Open theroggy opened 1 year ago

theroggy commented 1 year ago

DataCube.dimension_labels() now seems to return a DataCube in the python client: DataCube.dimension_labels

I think it should return an array of strings (e.g. the band names for dimension "bands"). According to the general process documentation this should also be the case: processes.dimension_labels

soxofaan commented 1 year ago

Hi, interesting point you make here.

It's indeed confusing that DataCube.dimension_labels in Python client returns another DataCube object, which makes little sense. On the other hand, a DataCube in the python client is not a concrete data cube but a wrapper around "virtual" process graph representation. The data cube only becomes concrete at the back-end side, when "executing" the virtual data cube (e.g. by starting a batch job from it). Likewise the DataCube methods (in general) do not or can not return concrete data, like a concrete list of band names as you suggest.

That being said, a minimal improvement we can do here is not returning a DataCube, but another class/object that makes clear it's a non-concrete/virtual wrapper for something that can be executed to, e.g. a list.

soxofaan commented 1 year ago

Another thing: if you really want to inspect e.g. the bands from a given DataCube without first executing anything on the back-end, you can use the metadata attribute of a data cube. E.g.

>>> cube.metadata.band_dimension
BandDimension(type='bands', name='bands', bands=[Band(name='VH', common_name=None, wavelength_um=None, aliases=None, gsd=None), Band(name='VV', common_name=None, wavelength_um=None, aliases=None, gsd=None)])
theroggy commented 1 year ago

Strictly speaking I don't need the band names client-side. I want to loop over all bands to apply a specific linear_scale_range depending on the band name, but earlier on in the script I already add and/or rename some bands depending on some other logic, so it takes some extra code to keep track of the bands available in the cube.

Using the metadata object indeed seems like a nice workaround, or even a good solution. I applied it already in the sample code below:

    for band in [band.name for band in cube.metadata.band_dimension.bands]:
        if band.startswith("B02") or band.startswith("B03") or band.startswith("B04"):
            band_rescaled = cube.filter_bands(band).linear_scale_range(0, 2000, 0, 255)
        elif band.startswith("B08"):
            band_rescaled = cube.filter_bands(band).linear_scale_range(0, 5000, 0, 255)
        else:
            raise ValueError(f"rescale not implemented for band {band}")

Normally I think it would also be possible to keep everything server side by using DataCube.dimension_labels("bands"), and then possibly use text_begins instead of .startswith to do the band checks because this would return a Processbuilder (or something similar) rather than a real string array... but I don't think this is strictly better than your suggestion using metadata for my specific case?

Possibly there is a cleaner way of doing this, without using an explicit loop... but I don't know how.

soxofaan commented 1 year ago

That snippet looks fine at first sight. However, it's important to know what you do with the produced band_rescaled values . I guess you left them out of the snippet to keep it compact. If you do something like merge_cubes after this, that should be fine.

Normally I think it would also be possible to keep everything server side by using DataCube.dimension_labels("bands"), and then possibly use text_begins instead

Again it depends on what you want to do in detail, but I'm afraid this will be a dead end as openEO process graphs can not express everything we're used from languages like Python (there are no for-loops for example).

Possibly there is a cleaner way of doing this, without using an explicit loop... but I don't know how.

If it's just about applying a different scaling per group of bands as your snippet suggests, it can be done a bit more efficient as follows:

bands = [band.name for band in cube.metadata.band_dimension.bands]

bands2000 = [b for b in bands if any(b.startswith(p) for p in ["B02", "B03", "B04"])]
cube2000 = cube.filter_bands(bands2000).linear_scale_range(0, 2000, 0, 255)

bands5000 = [b for b in bands if any(b.startswith(p) for p in ["B08"])]
cube5000 = cube.filter_bands(bands5000).linear_scale_range(0, 5000, 0, 255)

scaled_cube = cube2000.merge_cubes(cube5000)
jdries commented 1 year ago

I would avoid an extra merge cubes, and go for an apply_dimension over bands. In apply_dimension, you will need indices, so you'll want to convert list of bands to list of indices and then perform scaling accordingly.

theroggy commented 1 year ago

I indeed merge the rescaled bands back together with merge_cubes().

Thanks for the advice!!!

I wonder... I don't seem to find any documentation pointing to cube.metadata. Is this work in progress, is it not really adviced to use it or didn't I search well enough?

soxofaan commented 1 year ago

I wonder... I don't seem to find any documentation pointing to cube.metadata. Is this work in progress, is it not really adviced to use it or didn't I search well enough?

it's a public attribute, and we won't remove/change it overnight, so you're free to use it. It's indeed not documented at https://open-eo.github.io/openeo-python-client/api.html because, roughly speaking, it wasn't intended to be an abstraction level a normal user should interact with. However, the use case discussed above, "get band names of a cube through cube.metadata.band_dimension()", is a valid question, so we should indeed expose this in the docs