Closed EmileSonneveld closed 1 month ago
I took a list of process graphs that ran successfully to test the verification. There where a few cases where the pixel count estimation could not work, and one false positive for COPERNICUS_30
>DEM
. Here, we probably have to specify an end date to the temporal extent of the collection, to avoid it counting too many layers. Or maybe better, specify a temporal resolution of 1000 years
Deduplication snippet used to reduce graphs collection from 14Mb to 4.5Mb:
import nltk
process_graph_list_cdse_path = Path("/home/emile/openeo/openeo-collection-tests/process_graph_list_mep.jsonl")
process_list = process_graph_list_cdse_path.read_text().splitlines()
process_list = list(sorted(process_list, key=len))
process_list = list(filter(lambda x: len(x) < 600000, process_list))
logger.info(f"edit_distance loop\n")
to_keep:list[bool] = [None] * len(process_list)
breaker = False
for i in range(0, len(process_list)):
for j in range(i + 1, min(i + 100, len(process_list))):
if breaker:
break
if to_keep[j] == False:
continue
str_i = process_list[i]
str_j = process_list[j]
if len(str_j) - len(str_i) > 5: # sorted, so j is always longer
break
if len(str_j) > 1000:
# estimates a higher edit_distance compared to levenstein
edit_distance = len(str_j) - len(str_i)
for k in range(len(str_i)):
if str_i[k] != str_j[k]:
edit_distance += 1
else:
edit_distance = nltk.edit_distance(str_i, str_j)
if edit_distance < 50:
to_keep[j] = False
# Can't put on true, because it might match later
process_list = list(filter(lambda x: to_keep[process_list.index(x)] is not False, process_list))
process_graph_list_cdse_path = Path("/home/emile/openeo/openeo-collection-tests/process_graph_list_mep2.jsonl")
with open(process_graph_list_cdse_path, mode="w") as f:
for process_graph in process_list:
f.write(process_graph + "\n")
Status update: A few back and forwards in merge review
TODO: Check if it blocks r-240503459b294bb9807553d14b287c60
In the last 2 weeks 38 requests where blocked. All because they where larger than 20000x20000 pixels. (No because of the pixel volume check yet) The largest blocked request would have been 1300360x645827 pixels
After this has been running for a while, it can be changed to really cancel requests. Right now, it will only log if it would cancel. https://github.com/Open-EO/openeo-geopyspark-driver/issues/616
An interesting test is also to run this on many successful process graphs and see if there are no false positives.