Open-EO / openeo-gfmap

Generic framework for EO mapping applications building on openEO
Apache License 2.0
4 stars 0 forks source link

Minimal example for downstream inference UDF #27

Closed kvantricht closed 2 months ago

kvantricht commented 5 months ago

We need a minimal example showing how external projects can make use of OpenEO-GFMAP functionality for inference purposes:

kvantricht commented 2 months ago

@VictorVerhaert, according to Hans you would already have an inference UDF notebook for grassland watch. Would you be able to share it in a PR so @GriffinBabe can have a look at it?

VictorVerhaert commented 2 months ago

Yes I'll add it to the examples on the github. If you want (and it fits in our next sprint) I could also take a look at creating an as minimal as possible example notebook.

VictorVerhaert commented 2 months ago

My inference notebook does not use GFMap however. I use a shared .py file containing the preprocessing steps. My extraction pipeline (GFmap) uses this .py file after the fetchers, but my inference pipeline just uses load_collection.

for now I would just suggest putting this example in https://github.com/Open-EO/openeo-community-examples and referencing it here

VictorVerhaert commented 2 months ago

FYI you can inspect my pipelines here: https://github.com/gisat/grasslandwatch/tree/main/lc_offline

kvantricht commented 2 months ago

My inference notebook does not use GFMap however.

Ah ok interesting. Definitely useful but we should also work on a GFMAP-based inference workflow here.

VictorVerhaert commented 2 months ago

I assume the functionality of GFMap for inference would mainly be to split up the spatial extent that we want to perform inference on as well as job managing right?

kvantricht commented 2 months ago

GFMAP standardizes band names across backends, lays out typical data flow paths, takes care of loading collections and rescaling them into the most efficient datatype, applies collection-specific standardized processes, etc. That goes much broader than just the job splitting concept.

VictorVerhaert commented 2 months ago

Yes of course, I meant what would be visible in the example notebook and what to focus on in the explanation. It might indeed be good to emphasize that using the same pipeline for extraction and inference is crucial for having accurate results due to the optimalisations you mention in the background.

GriffinBabe commented 2 months ago

@VictorVerhaert one thing about the extraction pipeline:

The S1 bands are scaled in uint16 in the following code block (in the fetching preprocessing) https://github.com/Open-EO/openeo-gfmap/blob/main/src/openeo_gfmap/fetching/s1.py#L132 This is a memory optimization for OpenEO, as the collections are in float32 power vals. Those values are automatically reconverted to decibels in the feature extractor, unless the users disables it with a flag: https://github.com/Open-EO/openeo-gfmap/blob/main/src/openeo_gfmap/features/feature_extractor.py#L110. Now I see here that you perform some operations on compositing, so probably we should do that rescaling after preprocessing and before entering in the FeatureExtractor.

GriffinBabe commented 2 months ago

@kvantricht @VictorVerhaert

I like the idea of using the common ONNX library. I see online that it is possible to convert any Sklearn, PyTorch and Tensorflow model to that format. Even catboost is directly compatible.

Based on the inference UDF of @VictorVerhaert and the Feature Extractor functionalities already implemented in GFMAP I came with this first idea for a Model Inference base class, that an user can override to implement its own model inference pipeline. Please take a look and tell me what do you think: https://github.com/Open-EO/openeo-gfmap/blob/a7b0cd7ff05e0de73460776fb148a31d8a0167f4/src/openeo_gfmap/inference/model_inference.py

We could very well also provide a Model Inference default implementation that requires an path to download the ONNX model and the name of the input tensor as unique parameters, and that returns the probability values or directly the max_probability argument.

One thing that needs to be taken care of by the user is the dependency of ONNX within the OpenEO job. On the long term this could be directly included in the default OpenEO UDF environment, but so far we need to specify the .zip file in the udf-dependency-archives parameters at the end of the job creation, which is done mannualy at the moment. Maybe that's something to discuss in the redesign dicussion @VincentVerelst

VictorVerhaert commented 2 months ago

on this last point: @HansVRP and I had a similar discussion this morning. I think that in the long run the onnxruntime should be included in the standard udf env, as we are advising different projects to use onnx models.

GriffinBabe commented 2 months ago

Closed by #88