Open-EO / openeo-processes-python

A Python representation of (most) openEO processes
Apache License 2.0
11 stars 4 forks source link

Calling processes without argument type sniffing #9

Open soxofaan opened 4 years ago

soxofaan commented 4 years ago

With the current openeo-processes-python API, every call (e.g. add(5, 2.5)) will involve type detection of the arguments ("num" twice in the example) in order to pick the correct implementation (Add.exec_num). Additionally, there might also be some list to numpy.array conversion.

Very often, an application using openeo-processes-python will known which calculation backend to use and all this detection/preprocessing stuff is unnecessary overhead.

What is or would be the best way of using openeo-processes-python when you know in advance what flavor (exec_num/exec_np/exec_xar/exec_dar/...) you want to use? Should one use Add.exec_num(5, 2.5) or are there more elegant solutions?

lforesta commented 4 years ago

Our idea was that you don't have to care about, since the type is in fact detected when calling the function add(), rather than the class with its method, like in your example. Do you think the overhead is so high?

@claxn did you try any tests about this?

soxofaan commented 4 years ago

Do you think the overhead is so high?

since c0f0a5ac67b924884e067aa8f04477d59a5e4849 it's better I guess, it just feels like bit of a waste to do all that detection logic if you already know in advance that you want the numpy version for example

claxn commented 4 years ago

Yes, indeed. If you find it more useful to directly call the functions without argument type sniffing, which would be also less error-prone, please go for that. However, I think it is a nice feature to have and provides a better interface. What if we add a specific keyword (e.g., dtype="numpy") to the functions? This keyword is then handled by the decorator, i.e., if such an argument is given, we skip the argument type sniffing.

lforesta commented 3 years ago

It's been a long time and none of us had much time for this package I guess, but I'm getting more interested in it since we will be using it more (particularly the xarray methods).

Recently I had some issues in correctly casting data to xarray and therefore I updated utils.py a few times (see #21). I'm thinking more and more that this automatic data casting is not so essential, and it can lead to strange errors/results, plus it's a bit difficult to debug imho. Although I was a supporter of the current implementation, I'm now leaning towards removing this wrapper altogether and let the user call the needed method, e.g. sum.exec_xar(data), which is way clearer and cannot lead to unexpected behavior.

16 is partially connected to this, with the automatic conversion to lists performed in the wrapper.

What do you think? @soxofaan @claxn @clausmichele @sophieherrmann

clausmichele commented 3 years ago

What if we have to add a single integer to an xarray? Should we still call add.exec_xar(data, 1) ?

lforesta commented 3 years ago

yes or equivalently sum.exec_xar([xr_1, xr_2, 1]), depending on the input

clausmichele commented 3 years ago

I think that we should consider this looking into the processes implemented for the different data types. For example, some processes can't be available using numpy, since it lacks metadata support. On the other hand, only few are implemented using xarray, but it is not difficult to add new ones. If the future plan is to have all the processes working with xarray, I would rather wait and make it the default when the goal is reached.

sophieherrmann commented 3 years ago

Although I originally really liked the idea of the automatic detection of the correct function I agree with most of you that the overhead is getting to high. For what I know EODC and partly VITO (? @soxofaan ) are the only users of this package at the moment, and both are focusing on the xarray implementation.

Thanks to @ValentinaHutter by now a very large number of xarray processes are already implemented. So we could think about stripping away number and numpy implementations.

The only issue I currently see with only providing xarray functions are some corner cases which still use numpy implementations: e.g. for PGs using fit_curve we currently use oeop.array_element(**{'data': parameters, 'index': 1}) with parameters=[1, 1, 1] which is not a xarray and still uses the numpy implementation. It shouldn't be a problem to also use the xarray implementation in this case but it should definitely be tested explicitly.

soxofaan commented 3 years ago

For what I know EODC and partly VITO (? @soxofaan ) are the only users of this package

We use it indeed in the openeo-python-driver to basically to execute simple math/string/array operations that are not related to datacubes (the math there happens in Spark/Scala space). For example if you send this non-datacube process graph

{"add": {"process_id": "add", "arguments": {"x": 3, "y": 5}, "result": true}}

to the VITO backend it is executed by using openeo-processes-python functionality directly, very close to the http serving layer, without going into the depths of heavy duty DataCube handling (openeo-geopyspark-driver, openeo-geotrellis-extensions and alike)

So in that sense, xarray is even a bit of overkill for our use of the library