Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

Enable sync request blocking #692

Closed EmileSonneveld closed 1 month ago

EmileSonneveld commented 4 months ago

After this has been running for a while, it can be changed to really cancel requests. Right now, it will only log if it would cancel.

An interesting test is also to run this on many successful process graphs and see if there are no false positives.

EmileSonneveld commented 4 months ago

I took a list of process graphs that ran successfully to test the verification. There where a few cases where the pixel count estimation could not work, and one false positive for COPERNICUS_30>DEM. Here, we probably have to specify an end date to the temporal extent of the collection, to avoid it counting too many layers. Or maybe better, specify a temporal resolution of 1000 years

EmileSonneveld commented 3 months ago

Deduplication snippet used to reduce graphs collection from 14Mb to 4.5Mb:

import nltk
process_graph_list_cdse_path = Path("/home/emile/openeo/openeo-collection-tests/process_graph_list_mep.jsonl")

process_list = process_graph_list_cdse_path.read_text().splitlines()

process_list = list(sorted(process_list, key=len))
process_list = list(filter(lambda x: len(x) < 600000, process_list))"edit_distance loop\n")
to_keep:list[bool] = [None] * len(process_list)
breaker = False
for i in range(0, len(process_list)):
    for j in range(i + 1, min(i + 100, len(process_list))):
        if breaker:
        if to_keep[j] == False:
        str_i = process_list[i]
        str_j = process_list[j]
        if len(str_j) - len(str_i) > 5:  # sorted, so j is always longer
        if len(str_j) > 1000:
            # estimates a higher edit_distance compared to levenstein
            edit_distance = len(str_j) - len(str_i)
            for k in range(len(str_i)):
                if str_i[k] != str_j[k]:
                    edit_distance += 1
            edit_distance = nltk.edit_distance(str_i, str_j)
        if edit_distance < 50:
            to_keep[j] = False
            # Can't put on true, because it might match later

process_list = list(filter(lambda x: to_keep[process_list.index(x)] is not False, process_list))

process_graph_list_cdse_path = Path("/home/emile/openeo/openeo-collection-tests/process_graph_list_mep2.jsonl")
with open(process_graph_list_cdse_path, mode="w") as f:
    for process_graph in process_list:
        f.write(process_graph + "\n")
EmileSonneveld commented 2 months ago

Status update: A few back and forwards in merge review

JeroenVerstraelen commented 1 month ago

TODO: Check if it blocks r-240503459b294bb9807553d14b287c60

EmileSonneveld commented 1 month ago

In the last 2 weeks 38 requests where blocked. All because they where larger than 20000x20000 pixels. (No because of the pixel volume check yet) The largest blocked request would have been 1300360x645827 pixels