SRD-07 data retrieval from openEO: raw bands, based on samples

jdries commented 2 years ago

Inputs:

geodataframe or geojson?
time range
projection + resolution

mlubej commented 2 years ago

I've been playing around with this and trying to make it work in openEO via the VITO backend.

My starting point was a geodataframe of 10 parcels (to start simple) which are relatively close together:

This is the code used to get to the final datacube:

# set extent from the geodataframe
minx, miny, maxx, maxy = gdf.unary_union.envelope.bounds
spat_ext = dict(west=minx, east=maxx, north=maxy, south=miny, crs=gdf.crs.to_epsg())
temp_ext = ["2019-01-01", "2019-12-31"]

# link to collection, specify bands
s2 = connection.load_collection(
    'SENTINEL2_L2A_SENTINELHUB',
    spatial_extent=spat_ext,
    temporal_extent=temp_ext,
    bands=["B04","B08","SCL"]
)

# filter to specific geometries
s2 = s2.filter_spatial(geometries = json.loads(gdf.geometry.to_json()))

# SCL cleanup/filtering
s2 = s2.process("mask_scl_dilation", data=s2, scl_band_name="SCL")

# calculate NDVI
s2 = s2.ndvi(red="B04", nir="B08", target_band='NDVI')

# download the data and load it
job = s2.execute_batch("cube.nc", out_format="netCDF")
ds = xarray.load_dataset('./cube.nc')

Observations:

If I understand correctly, the above process filters the data after it downloads it, which is not optimal, since it would be better to filter already only the area of interest while downloading.
for 2/10/100 parcels it takes 5/6/19 min to execute the datacube pipeline for a single year (2019)
- there seems to be some initialization period regardless of the amount of data you're downloading
- it seems to me this is not so optimal and makes me question a bit if I'm doing things right, because this might prove to be a bottleneck when downloading for a large number of parcels (order of 1M) and spanning a larger spatial extent
- should any parallelization be implemented on the user side?
didn't work for any CRS other than 4326 (WGS 84)

The steps from this point (having a .nc dataset on disk) start to differ now w.r.t. to the approach we would be taking:

pixel based
parcel based

Pixel-based approach

The goal here would be to obtain a pandas dataframe, where each row represents a timeseries for a single pixel, and where only the pixels from the area of interest are taken, but one should still be able to subset the dataset to a specific geometry - spatial context should be available.

I'm not sure what the proper way of doing this is, but my initial approach took me down the rioxarray path, where the idea was to first spatially clip the xarray to the extent of the geometry of interest and then convert the xarray to a dataframe.

Another way would be to rasterize the geometry mask into the xarray and then convert it, but I guess you would need to add a timeless feature to the xarray, haven't tried that yet.

Parcel-based approach

This one is a bit more manageable, because you can just add the spatial aggregation function to the datacube pipeline.

However, this requires re-downloading the data, where the spatial aggregation is done on the fly. It would make much more sense to just run the aggregation on the pixel-level dataframe if it is already available.

If the pixel-level dataset is not available, then this is fine and can be achieved via:

from openeo.rest.conversions import timeseries_json_to_pandas

aggregated_data_json = s2.aggregate_spatial(geometries=json.loads(gdf.geometry.to_json()), reducer='mean').execute()
timeseries_json_to_pandas(aggregated_data_json)

This returns an output in the following format:

Which means that we would need to reshape the dataframe a bit to get the row-based format which is easier to work with.

thoughts/comments/suggestions?

jdries commented 1 year ago

The key thing here is the 'sample_by_feature' setting: https://open-eo.github.io/openeo-python-client/cookbook/sampling.html#dataset-sampling

This will give you one netcdf per parcel, containing all pixels, masked to the geometry. So from there you can indeed compute aggregates easily. For working with aggregate_spatial, I recommend either CSV or netcdf output formats, which are more easy to handle than the default json.

Features are sent in geojson, the official spec only supports WGS84. There's indeed some performance considerations to check as well, we have been running it ourselves up to 100000 parcels for instance.

mlubej commented 1 year ago

The key thing here is the 'sample_by_feature' setting

For working with aggregate_spatial, I recommend either CSV or netcdf output formats, which are more easy to handle than the default json.

Excellent! thanks for the tips. I tried the sample_by_feature via execute_batch, but I get an error:

MultipleAssetException                    Traceback (most recent call last)
File ~/.pyenv/versions/3.8.7/envs/ai4food/lib/python3.8/site-packages/openeo/rest/job.py:396, in JobResults.download_file(self, target, name)
    395 try:
--> 396     return self.get_asset(name=name).download(target=target)
    397 except MultipleAssetException:

File ~/.pyenv/versions/3.8.7/envs/ai4food/lib/python3.8/site-packages/openeo/rest/job.py:373, in JobResults.get_asset(self, name)
    372     else:
--> 373         raise MultipleAssetException("Multiple result assets for job {j}: {a}".format(
    374             j=self._job.job_id, a=[a.name for a in assets]
    375         ))
    376 else:

MultipleAssetException: Multiple result assets for job j-c153a2fc6138432c8c033a03a023d1be: ['openEO_0.nc', 'openEO_1.nc', 'openEO_10.nc', 'openEO_11.nc', 'openEO_2.nc', 'openEO_3.nc', 'openEO_4.nc', 'openEO_5.nc', 'openEO_6.nc', 'openEO_7.nc', 'openEO_8.nc', 'openEO_9.nc']

During handling of the above exception, another exception occurred:

OpenEoClientException                     Traceback (most recent call last)
/Users/mlubej/work/projects/AI4FOOD/FuseTS/notebooks/FuseTS_timeseries_production/create_object_based_signals.ipynb Cell 4 in <cell line: 16>()
     [14](vscode-notebook-cell:/Users/mlubej/work/projects/AI4FOOD/FuseTS/notebooks/FuseTS_timeseries_production/create_object_based_signals.ipynb#W3sZmlsZQ%3D%3D?line=13) s2 = s2.ndvi(red="B04", nir="B08", target_band='NDVI')
     [16](vscode-notebook-cell:/Users/mlubej/work/projects/AI4FOOD/FuseTS/notebooks/FuseTS_timeseries_production/create_object_based_signals.ipynb#W3sZmlsZQ%3D%3D?line=15) if not os.path.exists('./s2_meadow.nc'):
---> [17](vscode-notebook-cell:/Users/mlubej/work/projects/AI4FOOD/FuseTS/notebooks/FuseTS_timeseries_production/create_object_based_signals.ipynb#W3sZmlsZQ%3D%3D?line=16)     job = s2.execute_batch("s2_meadow.nc", out_format="netCDF", sample_by_feature=True)
     [18](vscode-notebook-cell:/Users/mlubej/work/projects/AI4FOOD/FuseTS/notebooks/FuseTS_timeseries_production/create_object_based_signals.ipynb#W3sZmlsZQ%3D%3D?line=17) else:
     [19](vscode-notebook-cell:/Users/mlubej/work/projects/AI4FOOD/FuseTS/notebooks/FuseTS_timeseries_production/create_object_based_signals.ipynb#W3sZmlsZQ%3D%3D?line=18)     ds = xarray.load_dataset('./s2_meadow.nc')
...
    397 except MultipleAssetException:
--> 398     raise OpenEoClientException(
    399         "Can not use `download_file` with multiple assets. Use `download_files` instead.")

OpenEoClientException: Can not use `download_file` with multiple assets. Use `download_files` instead.

mlubej commented 1 year ago

One more question - what would be the most sensible way to burn a raster of parcel IDs from the geometries into the xarray dataset?

My initial thoughts were:

rasterize the vector layer with the parcel IDs
import the raster to xarray
concatenate the xarrays (not sure if this can be done as a part of the openEO pipeline, or does it have to be done offline)

Or do you suggest some other way?

jdries commented 1 year ago

The error occurs because this call indeed results in multiple files. Try removing the file name, and rather assign the result of execute_batch to a 'job' variable. You can then retrieve results from this variable and invoke download_files. Or you can use the openeo web editor to even inspect the results of that failed call.

We're actually working on storing the id of input features also in output: https://github.com/Open-EO/openeo-geotrellis-extensions/issues/74 This is for the aggregate_spatial case, but if we can do that, we probably also can do something for sample_by_feature. Do you really want to rasterize it into pixels, or can it also be as an attribute inside the netCDF? (And as filename.)

mlubej commented 1 year ago

The error occurs because this call indeed results in multiple files. Try removing the file name, and rather assign the result of execute_batch to a 'job' variable. You can then retrieve results from this variable and invoke download_files.

Thanks, this worked.

We're actually working on storing the id of input features also in output. This is for the aggregate_spatial case, but if we can do that, we probably also can do something for sample_by_feature.

That sounds perfect. If I have the id of the input feature, then I can also merge this info into the dataframe in the end

Do you really want to rasterize it into pixels, or can it also be as an attribute inside the netCDF? (And as filename.)

I guess it can be as an attribute, this just means that I'll have to add it to the dataframe manually, but since this seems like a specific thing for my usecase anyway, it's fine by me if it's the way you describe it.

mlubej commented 1 year ago

After downloading a .nc file for each feature, I was able to load the dataset, convert it to a dataframe, reshape it, add metainfo, and merge everything:

Which then allowed me to group by the sampling feature and plot all pixels which correspond to it, and also to calculate the spatial average for each timestamp and plot also that.

Seems like a lot of clouds / invalid data gets through still.

@jdries I noticed the SENTINEL2_L2A_SENTINELHUB collection is used, where also the CLP and CLM calculated bands should be available. I tried downloading them but didn't get anything. Do you perhaps know why not?

More info at https://docs.sentinel-hub.com/api/latest/user-guides/cloud-masks/#cloud-masks-and-cloud-probabilities

Open-EO / FuseTS

SRD-07 data retrieval from openEO: raw bands, based on samples #63

Pixel-based approach

Parcel-based approach