Write data cube guide - Githubissues

m-mohr commented 3 years ago

It became obvious several times in openEO history that people often are not aware of how data cubes and their methods (reduce, apply, ...) work. So I was thinking that a guide how to work with data cubes would help the understanding, step by step with examples.

Discussion in https://github.com/Open-EO/openeo-processes/pull/215#discussion_r551218809 have shown that the document should say that it's usually not a good idea to change data types in apply/reduce/... and probably also list other pitfalls and potential limitations.

jonathom commented 3 years ago

In the search for a good visual representation, here are some first ideas:

I like the way things are displayed in the R stars package: Screenshot from 2021-01-09 10-12-04 It also holds a good representation of what vector cubes are: (as mentioned before, images taken from https://r-spatial.github.io/stars/)

I have another idea that, in my view, is able to explain the sort of data that is held in DataCubes and therefore can show that DataCubes are n-dimensional (here: time, 3 bands and x, y). A first sketch (ignoring the structure on the bottom right): index Obviously this needs some improvement, e.g. could the raster be displayed as shown above, and the earth's surface be depictd in more detail.

jonathom commented 3 years ago

Another possibility: Display as an actual cube, have z = time dimension and indicate different bands. However people might take the "Cube" too literally (as DataCubes can also contain 3, 5, etc. bands). With this it might be easier to graphically represent the cube operations.

edzer commented 3 years ago

Great sketches! That last one suggests that B2, B3, B4 and B8 are distributed over two dimensions, which is not very intuitive IMO, but showing that dimensions can be exchanged makes some sense. I put the R scripts that generated above figures at https://gist.github.com/edzer/5f1b0faa3e93073784e01d5a4bb60eca

m-mohr commented 3 years ago

Yeah, I think your first sketch works very well with some more details. Spatial are x and y, z is the bands and could be visualized with different colors (e.g. different shades of the color per pixel, one band red, one band green, one band blue) and then have each timestamp be part of your timeline.

Vector cubes in openEO are not really a thing at the moment so we could skip that part for now, but if you have good ideas, feel free to write them down anyway and we can have them in a separate markdown file for now.

jonathom commented 3 years ago

Thank you for the feedback @m-mohr and also for the code @edzer, here's a first implementation of the idea: cube1_ts_longer_points

Still missing a representation of the surface (also not 100% sure if needed).

Please let me know any feedback. Sketches and/or graphs on the processes will follow.

m-mohr commented 3 years ago

I like that a lot, well done! Could you change the pink color to yellow or so? I find it hard to distinguish from the red above... or change the order of the colors to not have red and pink directly after each other.

jonathom commented 3 years ago

exp_resample_time This is a figure representing temporal resampling. I decided to not represent the resampling process itself (calculation of new time steps). Let me know if you disagree. I have a question regarding the date "2020-09-28" in the upsampling process: I am guessing that the resulting datacube just doesn't contain an image for dates that lie before the first date of the original cube. Is that correct? Would it be appropriate to delete the entry for "2020-09-28" on the timeline for the "output" (but keeping it at the "resample" timeline to show the difference)?

m-mohr commented 3 years ago

Whether 2020-09-28 has data or not depends on the upsampling method you use. Would it make sense to just remove the empty timestamp as indeed it would likely not be in the resulting data cube (or at least would be there with no-data).

I think I'm fine with not giving more details on the resampling, but maybe it's easier to understand if you change the label "resample" to "resample to"?

All the images look the same, which may confuse some, but overall I like the image. 👍

jonathom commented 3 years ago

it would likely not be in the resulting data cube (or at least would be there with no-data).

Yeah, this is the tricky part because I think if it's there with no data, then the current image is exactly right. But if this is dependent on the resampling function I will delete the point for the first date, it's more intuitive.

"resample" to "resample to"?

sure! good idea.

All the images look the same

Yes, I will change this. Downsampling method will then be "mean" if that's alright. EDIT: things won't look so different then I'm afraid. Ideas to change that?

2nd EDIT: input is actually already displaying different time steps. Is the difference too subtle at this scale?

jonathom commented 3 years ago

like so exp_resample_time

m-mohr commented 3 years ago

Yeah, I now see that there's a subtle difference, but you need to look very closely to figure it out. Not sure whether that is actually an issue though. I guess we can leave it as it is for now. Changes in times series are often pretty subtle...

Other than that, the image looks good to me, thanks! 👍

jonathom commented 3 years ago

I have some questions about the spatial aggregation processes:

The specification currently states that only a 3D cube (x,y + one other) can be processed. The topic is also discussed here: openeo/processes#126. Is this expected to change at some point? I would favor leaving this restriction out of the graphic if that's alright.
Just our of curiosity, I don't really get what exactly aggregate_spatial_binary is doing. Instead of a list it only gets passed two values. Which two values and what's the advantage of that?

Regarding the previous discussion

Changes in times series are often pretty subtle...

Indeed. I think that in most graphics these very subtle changes are ok (as you say, we can always change that later on). They also result from the fact that breaks are set automatically for each raster. In the case where this is important (apply graphics, looking at single pixel values), I manually set breaks (so far only for third graphic).

edit @m-mohr

m-mohr commented 3 years ago

Is this expected to change at some point?

Not sure. I think not in the next 6 months at least.

I would favor leaving this restriction out of the graphic if that's alright.

Yes, I think that is fine for me.

Just our of curiosity, I don't really get what exactly aggregate_spatial_binary is doing.

It is basically the same, just the way it reduces the values is different.

Instead of a list it only gets passed two values. Which two values and what's the advantage of that?

binary uses a reducer (see e.g. the JS reduce operation) which works on two values, which allows reducing of very large lists that would otherwise exceed the memory. The list variant (i.e. non-binary) works on a list directly. So it's mostly a thing to optimize the operation for very large data.

m-mohr commented 3 years ago

@jonathom In this thread https://github.com/Open-EO/openeo-processes/pull/215#discussion_r551218809 we discussed that we should add some guidance that data cubes (child) processes should be careful with data type changes. Like if it gets an array of numbers in a reducer, should also return a number and not e.g. a string or array. Could you add that somewhere in the general data cube descriptions, please? cc for review: @soxofaan

jonathom commented 3 years ago

@m-mohr I'm not entirely sure if I understand what's going on, so let's discuss in next meeting. First thought: Maybe this is something for the cookbook (#16), since it is much more "how to do" instead of "how does it work"? Also, the cookbook could then just have a whole first section dedicated to "how to work with datacubes" to be a further reference after the datacube guide (not only because of this, just generally).

soxofaan commented 3 years ago

Nice diagrams!

Some feedback/ideas:

on the downsample part: the resample to of "2020-10-29" results in an output for "2010-10-30". I guess this is a typo?
about the very subtle differences between cubes at different time stamps: maybe you could add a clouded area in the middle input?
I would add a bit more space between the band layers, it will make the structure a bit more legible I think (especially for small sizes)
It looks a bit weird that there is no slice for 2020-09-28 in the upsample example. I understand that this depends on the upsampling technique, but I would ignore that implementation detail for the diagram. The diagram itself, without background info looks broken now.
In the current diagram there is room to use full titles "Temporal Downsampling" and "Temporal Upsampling". It's probably nitpicking but the pyramid shape might suggest to some people that there is also spatial down/upsampling going on otherwise,

jonathom commented 3 years ago

Thank you for the feedback @soxofaan! The datacube guide with much more graphics is already online and a version with some of your corrections (type, title change) can be seen here. I'd be happy if you want to have a look and leave some more feedback!

Regarding two points from above:

a clouded area is a good idea, however I think a lot of operations that are explained here wouldn't be executed on non-ARD. Might be confusing then.
space between the layers: Because the graphics are about different sampling processes, visibility of the single cubes isn't in focus in these graphics. However if you think other graphics in the datacube guide could use space / enlargement etc., let me know!

soxofaan commented 3 years ago

these online docs look very pretty, nice improvement!

m-mohr commented 3 years ago

@jonathom We also forgot to remove the Data Cube desction from the glossary: https://openeo.org/documentation/1.0/glossary.html

Another thing we should talk about in the "Dimensions" section is that the dimensions can have special characteristics, e.g. spatial and temporal are expected to have a natural order, temporal are by default Gregorian calendar, ...

jonathom commented 3 years ago

We also forgot to remove the Data Cube desction from the glossary

done, collecting these fixes in branch "dcguide". I added the old glossary datacube md in datacubes/.scripts for later reference.

additional note to myself: also forgot to talk about crs as dimension, as in old glossary

m-mohr commented 3 years ago

I added the old glossary datacube md in datacubes/.scripts for later reference.

I don't think this is required, we have version control for this. Let's discuss later

m-mohr commented 3 years ago

This is all done, right @jonathom ? Feel free to close then.

clausmichele commented 3 years ago

Thank you for the feedback @soxofaan! The datacube guide with much more graphics is already online and a version with some of your corrections (type, title change) can be seen here. I'd be happy if you want to have a look and leave some more feedback!

Regarding two points from above:
* a clouded area is a good idea, however I think a lot of operations that are explained here wouldn't be executed on non-ARD. Might be confusing then.

* space between the layers: Because the graphics are about different sampling processes, visibility of the single cubes isn't in focus in these graphics. However if you think other graphics in the datacube guide could use space / enlargement etc., let me know!

Really nice guide! I've just seen it and it will be super useful for many others.

Open-EO / openeo.org

Write data cube guide #26