Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

Enable sync request blocking #692

Closed EmileSonneveld closed 1 month ago

EmileSonneveld commented 4 months ago

After this has been running for a while, it can be changed to really cancel requests. Right now, it will only log if it would cancel. https://github.com/Open-EO/openeo-geopyspark-driver/issues/616

An interesting test is also to run this on many successful process graphs and see if there are no false positives.

EmileSonneveld commented 4 months ago

I took a list of process graphs that ran successfully to test the verification. There where a few cases where the pixel count estimation could not work, and one false positive for COPERNICUS_30>DEM. Here, we probably have to specify an end date to the temporal extent of the collection, to avoid it counting too many layers. Or maybe better, specify a temporal resolution of 1000 years

EmileSonneveld commented 3 months ago

Deduplication snippet used to reduce graphs collection from 14Mb to 4.5Mb:

import nltk
process_graph_list_cdse_path = Path("/home/emile/openeo/openeo-collection-tests/process_graph_list_mep.jsonl")

process_list = process_graph_list_cdse_path.read_text().splitlines()

process_list = list(sorted(process_list, key=len))
process_list = list(filter(lambda x: len(x) < 600000, process_list))
logger.info(f"edit_distance loop\n")
to_keep:list[bool] = [None] * len(process_list)
breaker = False
for i in range(0, len(process_list)):
    for j in range(i + 1, min(i + 100, len(process_list))):
        if breaker:
            break
        if to_keep[j] == False:
            continue
        str_i = process_list[i]
        str_j = process_list[j]
        if len(str_j) - len(str_i) > 5:  # sorted, so j is always longer
            break
        if len(str_j) > 1000:
            # estimates a higher edit_distance compared to levenstein
            edit_distance = len(str_j) - len(str_i)
            for k in range(len(str_i)):
                if str_i[k] != str_j[k]:
                    edit_distance += 1
        else:
            edit_distance = nltk.edit_distance(str_i, str_j)
        if edit_distance < 50:
            to_keep[j] = False
            # Can't put on true, because it might match later

process_list = list(filter(lambda x: to_keep[process_list.index(x)] is not False, process_list))

process_graph_list_cdse_path = Path("/home/emile/openeo/openeo-collection-tests/process_graph_list_mep2.jsonl")
with open(process_graph_list_cdse_path, mode="w") as f:
    for process_graph in process_list:
        f.write(process_graph + "\n")
EmileSonneveld commented 2 months ago

Status update: A few back and forwards in merge review

JeroenVerstraelen commented 1 month ago

TODO: Check if it blocks r-240503459b294bb9807553d14b287c60

EmileSonneveld commented 1 month ago

In the last 2 weeks 38 requests where blocked. All because they where larger than 20000x20000 pixels. (No because of the pixel volume check yet) The largest blocked request would have been 1300360x645827 pixels