WorldCereal / worldcereal-classification

This repository contains the classification module of the WorldCereal system.
https://esa-worldcereal.org/
MIT License
17 stars 2 forks source link

Solve memory issues for larger inference jobs #99

Closed kvantricht closed 4 weeks ago

kvantricht commented 1 month ago

Currently any job beyond 5km X 5km fails with a confusing memory issue. This needs to be investigated and fixed before we can send this system for review to ESA.

kvantricht commented 1 month ago

Currently updating the executor-memory and executor-memoryOverhead greatly helped.

job_options={
            "driver-memory": "4g",
            "executor-memory": "3g",
            "executor-memoryOverhead": "5g",
            "udf-dependency-archives": [f"{ONNX_DEPS_URL}#onnx_deps"],
}
jdries commented 4 weeks ago

Further investigation revealed this issue: https://github.com/Open-EO/openeo-gfmap/issues/142

I also added a new config to better constrain python memory use: https://github.com/eu-cdse/openeo-cdse-infra/issues/44 And then made decommission less aggressive: https://github.com/eu-cdse/openeo-cdse-infra/issues/197

With these things combined, I could run a job with total executor memory of 4GB, for a 20x20km tile.