Process graph simulation

m-mohr commented 4 years ago

During the processes telco, an issue came up that we don't have intermediate information about dimension labels available, especially individual timestamps after filtering. We need to find a way to allow this as it's important to know from a user perspective.

We could add a (free-to-use) /explore endpoint you can send a part of the process graph and it resturns cube metadata, e.g. the timestamps.

jdries commented 4 years ago

It's something I have been thinking about as well, but not always very easy. Could try it as an experimental feature if other backends are interested. The alternative approach for a user to figure out intermediate state is to use alternate versions of a process graph. For instance, I use polygonal aggregations a lot to figure out the list of available dates, or date with meaningfull data. This last point is important: your metadata may think that there is data at date x, but after applying cloud masking for instance, the number of actually relevant dates will be lower...

m-mohr commented 4 years ago

Well, users are interested as it seems to be something which is so important that it may actually break the whole "cloud" workflow for a user. So that seems to be something we need to address somehow. We still need to figure out what that is. Could also be the debug process, which would be similar to GEE's print, but would need very good integration in clients so users don't need to scroll through logs. I'm not sure whether your alternative approach would be something users would really do?!

MilutinMM commented 4 years ago

Just to add from user perspective that the timestamps are important metadata. As image observation dates are often irregular, user would like to know timestamps before specifying time interval, e.g., for the temporal aggregation process, or analysis of time series, etc. Using polygonal aggregation as Jeroen suggested, is a way around, but this also involves processing/costs, or?

m-mohr commented 4 years ago

Just to add from user perspective that the timestamps are important metadata.

Just getting timestamps for the collections in general is already possible, but it's not widely implemented as it (1) will be many timestamps and (2) it depends on the location etc.

That's why we probably need a bit more advanced way to "simulate" process graphs so that intermediate states of the metadata can be queried. Or the debug process and log files should be used instead.

Using polygonal aggregation as Jeroen suggested, is a way around, but this also involves processing/costs, or?

Yes, it involves costs. A "simulation" mode or any alternative should be free or very, very cheap, I think.

lforesta commented 4 years ago

What we discussed wasn't a process graph simulation, just the option of getting metadata about the selected cube view that the user wants to work with before actually running any processing. And I am definitely interested in supporting this functionality on our back-end, because I think it is essential for users.

I am not sure how we can simulate the process graph and return metadata at a point x in the middle of the process graph? For example, it could be possible to know the cube dimensions and size, but how to know if there is a valid pixel value at timestamp T5 in that cube after having applied n processes?

My suggestion is to keep exploration of 'initial' (meta)data separate from exploration of 'intermediate' (meta)data. The latter always involves processing data; for now we may include this in the /debug endpoint.

m-mohr commented 4 years ago

We discussed both and both is useful. Users may need the information somewhere in-between. Nevertheless, we can start with the simplest approach (just after load_collection + debug) first and then expand later.

Question is how to best place it in the API? Options could be:

GET /jobs/{job_id}/datacube (or whatever name): Process graph has been sent already via POST /jobs so now the back-end just needs to return the "metadata" about the data cubes that are actually returned by load_collection calls.. Drawback: Only works for batch jobs.
POST /simulation (or whatever name): Send a process graph once to the back-end and synchronously return the "metadata" about the data cubes that are actually returned by load_collection calls. Drawback: A synchronous call could be problematic (timeout?) for a huge amount of data requested in load_collection.
GET /collections/{collection_id}?spatial_extent=...&temporal_extent=...&bands=...&properties=...: Allow query parameters compatible to load_collection be passed to the collection endpoint so that the metadata is restricted to what has been specified in the parameters. Drawbacks: Seems a stretch to create the request, just sending the process graph is probably easier to implement for clients (and back-ends?). Also, passing a process graph in properties could lead to requiring a POST instead of a GET due to a lengthy query.

Please note that the endpoints may need to return multiple sets of metadata as multiple load_collection calls can be contained in the process graph.

soxofaan commented 4 years ago

I have been pondering about this kind or problem too, but then in the context of band (metadata) handling in the (python) client: should the client "simulate" what a backend is supposed to do in order to resolve band names/indexes properly, or should there be a way for the client to query metadata about an intermediate cube?

Anyway, about the original user question "for what dates do I have observations?" One could argue that these dates are not really metadata, but normal data as it depends for example on chosen spatio-temporal extent and cloud masking (or other) thresholds.

Given this perspective, let me spitball another possible solution: define a new process, e.g. "observation_dates" (feel free to replace with better name) that returns list of dates with "enough" (TBD what this should mean) non-empty data coverage in the spatial bbox.

Advantages:

no need to define dedicated API endpoint for this (although some kind of "Process graph is too difficult to simulate" error might be useful) avoiding feature creep there
being a process, it integrates effortlessly with the way process graphs work. Even though it looks like other processes, a backend still has liberty to take shortcuts for figuring out the non-empty dates if process graphs is simple enough and no actual image processing is necessary. Or fall back on actual image processing based approach, e.g. using very low-res data for efficiency. Or throw an error when it would be to hard to do efficiently. Or fall back on full processing of the process graph if the user is really ok with that.

MilutinMM commented 4 years ago

--- define a new process, e.g. "observation_dates" --- I like the idea, and I think it could be a useful process. If such a process can be applied per pixel then the output would be a unique list of timestamps for that pixel and specified sensor (e.g. S1b, S1a, etc.). If the process would be applied over an area then we would have a list of timestamps for each pixel. Those lists could be merged and only a list of unique elements could be send as output. Then, in a data cube view, this list could be used for labeling the time dimension. I would leave then to user to find out if there is a valid pixel value for all the pixels for a certain date.

--- there are some general comments when time stamps could be useful for a user: --

I assume that a remote sensing user would first familiarize with the data quality for certain area. To understand data quality, a user would need to know how many observations are per pixel available, and then further how many observations are there per year/season/quartal/month/week. Those info will help in understanding the data, e.g. before binning/aggregating, etc.
There is also another aspect that is related to time series. Timestamps (i.e. labels) are simply necessary there to perform time series analysis, to specify history/learning period, and monitoring period, etc.
Finally, timestamps would be also necessary when downloading rasters of individual dates, i.e. to know when we have a raster even though it covers just a portion of the queried spatial extend.

--- a general note --- We should not forget that the sampling along the dimension time is always irregular in time. The spatial sampling is always regular for particular band. It is only that the sampling frequency (i.e. pixel size) may change from band to band (e.g. S2).

jdries commented 4 years ago

Most of the use cases mentioned are supported already:

counting valid pixels over time is a reduce over time using 'count' method
retrieving valid dates over region is aggregate_polygon, optionally using count to get number of valid pixels inside polygon

Maybe it is a matter of properly documenting this, or even providing some convenience functions to the user, like an 'observation dates' function.

m-mohr commented 4 years ago

Indeed, counting is supported already. What is missing is getting the dimension labels (e.g. observation dates). What would solve the issue is implementing #91. Then it would be as easy as using

apply_dimension (or reduce) to get labels/timestamps for pixels
aggregate_polygon to get labels/timestamps for regions

In this case we would not need another process and it would mostly be inline with how counting works. If we add another process, I would probably define it more general as get_labels(raster-cube, dimension) or so (i.e. pass data cube and dimension and return an array of labels).

The difference between simulation in the API and a normal process is the cost! Simulation would probably be somewhat cheaper or free instead of doing normal processing as it belongs to the normal data discovery process, I guess?

lforesta commented 4 years ago

"valid" -> not sure what everyone means with this expression here. We can use the process count to count pixels/dates, but judging their validity is a different topic. Unless the user with a-priori knowledge expects to have N dates and get n<N and hence knows some data are not there (maybe the collection has already a cloud mask applied or so).

initial metadata For getting the metadata of the initial cube(s) returned by load_collection calls I would use a separate endpoint "explore_metadata" or so. In any case, I would avoid the word "simulation" since it hints that the full pg can be simulated, which it can't.

in between (meta)data If a user needs to have (meta)data in between the pg, this is an entirely different problem, and it requires processing. For this, I would use the process "debug"; we could try to properly define how this process should work in practice.

The main thing is that I would not mix the two topics into one.

m-mohr commented 4 years ago

"valid" -> not sure what everyone means with this expression here.

Valid according to the spec: see https://processes.openeo.org/#count and https://processes.openeo.org/#is_valid

For getting the metadata of the initial cube(s) returned by load_collection calls I would use a separate endpoint "explore_metadata" or so.

Not sure whether for such a limited scope we really need a new endpoint. If it's really just passing some parameters, I'd be in favor of adding parameters to the /collection/id endpoint as proposed above.

In any case, I would avoid the word "simulation" since it hints that the full pg can be simulated, which it can't.

If one would want to implement it, a process graph could technically be somewhat simulated. I guess a user would find it useful, but I see it's very hard to implement.

If a user needs to have (meta)data in between the pg, this is an entirely different problem, and it requires processing. For this, I would use the process "debug"; we could try to properly define how this process should work in practice.

I'd expect that passing a data cube to debug would result in a similar output as /collections/id. Telling a user that he needs to run and pay for a process graph twice is probably not selling very well.

lforesta commented 4 years ago

If one would want to implement it, a process graph could technically be somewhat simulated. I guess a user would find it useful, but I see it's very hard to implement.

Can we maybe define specifically what this simulation should return? For example the metadata of the datacube (dimensions and their cardinality) after each process? Or something more? If it's just the dimensions and their cardinality, this should be known by the user (e.g. after a reduce over time, this dimension is dropped, after a monthly temporal resampling, the cardinality of the temporal dimension has changed, etc ...)

I'd expect that passing a data cube to debug would result in a similar output as /collections/id. Telling a user that he needs to run and pay for a process graph twice is probably not selling very well.

Probably we are thinking of different things? I'm thinking of debug as a sort of (asynchronous) breakpoint that a user can place in a process graph, get the output of and then resume processing after inspecting the intermediate result (difficult and expensive to implement/run, but not completely impossible)

m-mohr commented 4 years ago

@lforesta

Can we maybe define specifically what this simulation should return? For example the metadata of the datacube (dimensions and their cardinality) after each process? Or something more?

Whatever the user needs (not sure yet - we need to ask the users, e.g. @przell, @MilutinMM, ...), but at least the dimension labels were requested above and that's nothing a user can know from the processes itself. Indeed, the user should know the dimensions by inspecting the processes.

Probably we are thinking of different things?

Seems so. Have you read https://processes.openeo.org/draft/#debug ? The API doesn't support halt/pause processing at any point, debug just sends information to a log file.

lforesta commented 4 years ago

Have you read https://processes.openeo.org/draft/#debug ? The API doesn't support halt/pause processing at any point, debug just sends information to a log file.

Yes I read it and I know there's not pause in the API, I thought we were also discussing if/how to change the debug process, never mind.

Back to the "simulation", I think it's hard (impossible?) to get dimensions labels and other types of infos about the datacube at a point x in between the pg without doing any processing. But I'll think more about our own implementation and see if there's a solution.

m-mohr commented 4 years ago

Back to the "simulation", I think it's hard (impossible?) to get dimensions labels and other types of infos about the datacube at a point x in between the pg without doing any processing.

That could be a conclusion and then the user has to bite the bullet. It's the same in GEE, with the "minor" difference that user's just do it and don't complain because it's free and usually fast.

przell commented 4 years ago

This has been mentioned before by others: I think it makes sense to distinguish the two use cases we are talking about (and placing them in different issues).

Identify time steps of the raw data cube before starting processing
Identify time steps at different stages of a process graph

To 1: As a user I would expect to get the time steps of a complete data cube in a similar way as describe_collection() in the r client. This currently gives only the start and end of the temporal dimension.

m-mohr commented 4 years ago

Okay, let's conclude for now:

We need to decide where to identify time steps of the raw data cube before starting processing. Potential options are listed above: https://github.com/Open-EO/openeo-api/issues/243#issuecomment-561721244 (Not sure myself, but tend slightly to option 3).
To identify time steps at different stages of a process graph: Document how debug works and what is expected to be logged when a raster cube is passed as input.

jdries commented 4 years ago

There's a lot of focus on cost on this issue, but please know that even if you call it 'simulation' or 'debug' in the api, the cost will be entirely the same. There's no 'cheap' option for me to lookup nodata pixels inside an area other than reading and counting them.

przell commented 4 years ago

@jdries: I think what you are describing is even another use case:

getting "valid" pixels at some step of a process graph I also can't imagine how to get this information without actually running the process. In my opinion, the other two cases are more closely related to the metadata and dimensions of a cube rather than actual values.

m-mohr commented 4 years ago

Moving this from 1.0-rc1 to final as I'd first like to see some implementations of debug and how well it works and what is missing before adding more things to the API. Afterwards we can evaluate what is missing and choose the best option to proceed with.

m-mohr commented 4 years ago

Any implementation available? Otherwise we probably have to push this from 1.0-final to future...

soxofaan commented 2 years ago

FYI: I just got this request from an aspiring openeo user (new VITO colleague):

I'm an (ex) GEE user and was wondering if it is possible in openeo to print information. For example if you loaded a collection and filtered on date and location: is it possible to inspect how many images are in the cube, which date, etc?

(translated from Dutch)

m-mohr commented 2 years ago

@soxofaan That's what the debug process is meant to be used for, that's basically the 1:1 equivalent to the getInfo() call the user knows from GEE. (And also as discussed in the meeting today.)

soxofaan commented 2 years ago

@soxofaan That's what the debug process is meant to be used for, that's basically the 1:1 equivalent to the getInfo() call the user knows from GEE. (And also as discussed in the meeting today.)

I don't completely agree:

debug is currently pretty vague about what information should be made available, but listing/summarizing dimension ranges of a cube is probably a bare minimum
debug, being a process, assumes you do sync/async processing of your datacube, I think it's backward to have to download a whole cube just to inspect the metadata
debug writes to the logs, so you even have to run a batch job (not ideal when you are doing initial exploration), you have to figure out how to get the logs (which heavily depends on the client you are using), and you have to search in them (logs can have very poor signal-noise ratio)

I'm not debating against debug, but in the spirit of improving the (interactive) user friendliness, I don't think debug should be the final answer for the user problem "what's available in this given spatio-temporal extent?".

m-mohr commented 2 years ago

Just to make sure: What the user refers to above is the getInfo() call in GEE. Have you worked with that before and are you aware how it works?

debug is currently pretty vague about what information should be made available, but listing/summarizing dimension ranges of a cube is probably a bare minimum

Indeed, because it's up to the implementation to decide what it can reasonably support. For Platform, we may need to agree on a common behavior, but that's nothing that we need to define in openEO in general. It also depends on what you insert there. The back-ends need to define behavior for different types, e.g. a summary for raster-cubes (e.g. STAC's cube:dimensions structure), vector-cubes, arrays (e.g. print some high-level statistics, see R), etc.

debug, being a process, assumes you do sync/async processing of your datacube, I think it's backward to have to download a whole cube just to inspect the metadata

The user requested basically getInfo (because it's the only way you can actually print in GEE) and debug is the equivalent with very similar behavior. getInfo also executes the whole flow and reports back in-between, but still runs the whole processing chain. Also, you don't need to download "a whole cube". How did you get to that assumption?

debug writes to the logs, so you even have to run a batch job (not ideal when you are doing initial exploration)

No, you can do small-scale synchronous processing and still use logs.

you have to figure out how to get the logs (which heavily depends on the client you are using)

Very easy in the Web Editor. For sync jobs, they open up automatically after completion, for batch jobs and services you can simply click the "bug" button, and a continuously updating UI for logs is shown. Other clients may need to catch up in supporting logs better, indeed.

and you have to search in them (logs can have very poor signal-noise ratio)

This is IMHO not an issue. You can set a custom "debug" identifier in the code parameter in debug. Afterward you can simply filter in the Web Editor or in the clients. That allows to retrieve a specific log entry easily.

I don't think debug should be the final answer for the user problem "what's available in this given spatio-temporal extent?".

Well, that was not the question you asked above. It was "I [...] was wondering if it is possible in openeo to print information". There might be better alternatives for the examples given, but for the general question the answer is: "debug". It was specified to capture exactly this use case.

The question about timestamps already came up in H2020: https://github.com/Open-EO/openeo-api/issues/346

Open-EO / openeo-api

Process graph simulation #243