Closed bossie closed 1 year ago
I found a piece of code in the mask function that probably breaks the flow: https://github.com/Open-EO/openeo-geopyspark-driver/blob/c749e357e6e04943508c7c088e102bc3eba6e611/openeogeotrellis/geopysparkdatacube.py#L950
In this case, the mask has the same layout as the original rdd, and it should certainly not be retiled. Nowadays, we also have an explicit check to resample in the scala part, so this bit may be entirely unnecessary.
the most usefull solution IMO would be a full 'pushdown' of the mask in cases where it is allowed, like this one. Like, if there are no process graph nodes between 'mask' and 'load_collection', a pushdown is clearly allowed.
that makes sense to me. Where is the initial filtering done? Also PyramidFactory.scala
? How I understand, now the 'pushdown' helps to avoid downloading areas from SentinelHub, and now, we can also explicitly add pixel precise masking
Ok to only support this when there are no nodes in-between?
So this is where mask pushdown happens: https://github.com/Open-EO/openeo-python-driver/blob/a753f9bf267a3bd8a424c0f783c2703e00487f34/openeo_driver/ProcessGraphDeserializer.py#L1560 And this fairly obscure code can help to make decisions based on analysis of the process graph: https://github.com/Open-EO/openeo-python-driver/blob/a753f9bf267a3bd8a424c0f783c2703e00487f34/openeo_driver/dry_run.py#L294
There is already a data_mask_optimization flag, maybe we can turn it on or off based on how far the mask node is from the load_collection, and what processes are in between.
@EmileSonneveld this is an example of a test that does assertions on 'LoadParameters' based on an input process graph: https://github.com/Open-EO/openeo-python-driver/blob/8d4fc5fc11e120f5ce7731777e7b2fec2c544356/tests/test_dry_run.py#L340
This is an example of an existing case of testing the mask process: https://github.com/Open-EO/openeo-python-driver/blob/8d4fc5fc11e120f5ce7731777e7b2fec2c544356/tests/test_views_execute.py#L674
We can go over these tomorrow to see how we can create a variant for this specific issue.
The problem was that a RDD containing mask data was loaded twice without being cached. Once to filter out completly masked data tiles, and once to apply a per-pixel masking on the remaining tiles. Caching could solve it, but the better solution is to apply the per-pixel masking directly on the first pass. This optimization can only be applied if there are whitelisted nodes between the mask node and the load_collection node. This whitelisting was not there before and can be a slight improvement when convolutions are applied on the data (SCL dilation for example)
This feature is covered by auto tests, but can also be tested with the following snippet: mask_twice.json.txt
This process graph will evaluate the raster mask RDD twice. This is not a huge problem for file-based collections but in the case of Sentinel Hub, it will do unnecessary tile requests. In particular this sends out a single tile request for the data (N02), and 2 identical tile requests for the mask (CLOUD_FRACTION), for a total of 3.
Enabling/omitting "node_caching" makes no difference. Strangely, I have also witnessed a batch job that in fact did only 2 requests (1 for the data, 1 for the mask) but that one didn't even specify "node_caching" (defaults to false). :thinking: