Open-EO / openeo-api

The openEO API specification
http://api.openeo.org
Apache License 2.0
91 stars 11 forks source link

Recommended way to attach more technical/tracking metadata to batch jobs, sync requests, ... #472

Open soxofaan opened 1 year ago

soxofaan commented 1 year ago

While working on improvements on our batch job tracking system in VITO backend I was wondering about a good way to collect some additional metadata about jobs (or submitted process graphs in general): e.g. is user using python client or R client, which version of the client, is user working in jupyter environment, is user working in hosted openeo.cloud jupyterlab, etc? Having this info immediately available when doing user support can save a couple of round trips in the forum or other communication channels :smile:

The Python client currently sets a user agent header (https://github.com/Open-EO/openeo-python-client/blob/master/openeo/rest/connection.py#L63-L64), but that is not as flexible for more general metadata as suggested above.

Does it make sense to standardize or have a recommendation to embed this kind of metadata in, e.g. at least, POST /jobs, POST /result, ...?

m-mohr commented 1 year ago

Depending on what you send that might be also something that is relevant with regards to GDPR, I assume. So we need to be careful.

What data would that effectively be and how would you get that? openEO client would be easy, I guess. But then the clients itself would need to detect somehow the environment. Might be possible for Jupyter, but whether it's hosted or not you can probably only set via env variables or hard code a host check.

Regardless of the privacy, the User Agent is indeed a place to put the client although in the Browser there are various other "User Agents" involved (Browser vs. HTTP client [axios] vs. openEO client).

Other thought: Someone may submit via Python client and then continue working in the Web Editor. So it's not necessarily saves a round trip and could even be misleading.

My tendency right now this should not be part of the API. Clients we could try to simply handle via User Agent, we just need to make sure they are set correctly in the clients itself.

soxofaan commented 1 year ago

My original question was whether it could be useful to annotate a process graph with some information about which software/tool/library generated the process graph. So something like

POST /jobs
{
    "title": "...",
    "process": {
        "process_graph": {...},
        "generator": "openeo-python-client v1.2.3",

But then I started wondering if that would be the most sensible location (maybe better at top level of request body?), and whether it could be more general?

It could also be useful for the end user that want's to keep track of the version of the process graph that was used in a batch job (e.g. with a date or version string or git commit hash)

... the User Agent is indeed a place to put the client although in the Browser there are various other "User Agents" involved (Browser vs. HTTP client [axios] vs. openEO client). Other thought: Someone may submit via Python client and then continue working in the Web Editor.

As mentioned earlier, I was mostly thinking about tracking the conditions of how/when the process graph was built, not necessarily how/when the actual REST requests are made.

My tendency right now this should not be part of the API.

Yes, fine for me, the current API already allows ad-hoc fields by design. I'm just fishing for some kind of weak recommendation or convention to avoid putting too much backend-specific things in the python client.

m-mohr commented 1 year ago

It's not always obvious where a PG was made. Especially with the Web Editor, you run into conflicts:

I'm just wondering here whether the idea is really worth the effort...

soxofaan commented 1 year ago

Of course, after a lot of processing steps, it can become unclear what the metadata is about or should be (e.g. we have the same problem with heavily processed EO data). This ticket is not about standardizing the full format or all the rules, it's just about a convention for the location (which is useful for use cases that have enough control over their workflow).

For example: you take a picture with your smartphone, download it to your computer to do some editing in, say, photoshop and then upload it to a sharing website (which also does some more processing). In the metadata (e.g. EXIF), it's indeed unclear what should be mentioned as processing software, but it is still highly valuable to have place to put the capture date or location.

jdries commented 1 year ago

from #480, I like the idea of a 'stac' property that allows us to customize output metadata, because I also have other use cases for it. We will want to properly consider how to deal with metadata at collection/item/asset level?

Note that the original question in this ticket, about automatic tracking of certain properties, is slightly different from the user wanting to track certain properties. (If users for instance sets properties, do they override the automatic ones, or do we perform some merging?)

m-mohr commented 1 year ago

There are indeed two different use cases in here:

  1. Customize result metadata (this is just for batch jobs)
    • This gets difficult quickly thinking about the different STAC entities (Collections [top-level], Items [in properties], Assets [top-level, but which?])
    • Would you really attach metadata to assets?
    • Otherwise, just merging into the corresponding location for metadata should work (conflict handling tbc)
  2. Add "metadata" from client libraries (i.e. the "generators" of the process)
    • This can be per node (e.g. the Web Editor adds the box location for each node) or per process (e.g. the client version)
    • In general the specifications allow additional properties, so adding them in the node oder the process is not violating the spec.
    • We may want to group them, e.g. in a field such as "generator" or so, or we specify a prefix that everyone can use at the top-level?