Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

load_stac of long timeseries on cdse takes very long #799

Open jdries opened 3 weeks ago

jdries commented 3 weeks ago

job id: j-2406079ce2dc4a4e863f4e4881c3778f This job took 45+ minutes in the driver before even starting the actual processing. Batch job logging indicates that it happened in the 'load_stac' implementation itself. This could make sense because retrieving metadata for all the individual items might have been slow.

Process graph was very simple:

{
  "process_graph": {
    "loadstac1": {
      "process_id": "load_stac",
      "arguments": {
        "url": "https://openeo.dataspace.copernicus.eu/openeo/1.2/jobs/j-2406072110944ac9ae59a2fb48d47a10/results/MWFhNjdmNzUtZTE3OC00OTVlLWExMzEtOWY3ZGZmYWFhZTE4/7b007810d6c1820ffd9c4ee8fb18c339?expires=1718367635"
      }
    },
    "saveresult1": {
      "process_id": "save_result",
      "arguments": {
        "data": {
          "from_node": "loadstac1"
        },
        "format": "GTiff",
        "options": {}
      },
      "result": true
    }
  }
}
jdries commented 2 weeks ago

We use pystac to resolve all items, so this may be hard to parallelize within the backend. Maybe looking into a speedup of item retrieval itself is a better possibility.

jdries commented 2 weeks ago

I identified a method which is called once per item, but with same arguments (job and user id). The method is relatively expensive, so added caching, hoping to drastically improve performance.

bossie commented 2 weeks ago

At which point is there a call to get_result_assets in this case?

jdries commented 2 weeks ago

Here: https://github.com/Open-EO/openeo-python-driver/blob/d5725229080989982436ce3986efb9b732e35792/openeo_driver/views.py#L1287 ? I am however now wondering if caching it is safe, the job results call here: https://github.com/Open-EO/openeo-python-driver/blob/d5725229080989982436ce3986efb9b732e35792/openeo_driver/views.py#L963 will also call get_results_assets.

~~So while job is not yet finished, assets will be incomplete, but method call gets cached because user is polling for partial results. Then job finishes, incorrect result without assets is returned because of caching...~~

Correction: there's a return statement for unfinished jobs in list_job_results, so assets will not be requested.

bossie commented 2 weeks ago

Not quite the same as this is about a STAC API, but while debugging Darius' AGERA5 issue locally, I noticed that the time between these logs is gradually getting longer, even for items within the same page (FeatureCollection):

2024-06-12 14:55:23,948 DEBUG [Thread-4] file.FixedFeaturesOpenSearchClient (FixedFeaturesOpenSearchClient.scala:36) - added Feature(agera520210616,Extent(-180.05, -90.05, 179.95, 90.05),2021-06-16T00:00Z,[Lorg.openeo.opensearch.OpenSearchResponses$Link;@246520d8,None,None,Some(POLYGON ((179.95 -90.05, 179.95 90.05, -180.05 90.05, -180.05 -90.05, 179.95 -90.05))),None,GeneralProperties(None,None,None,None,None),None,0.0)
2024-06-12 14:55:26,279 DEBUG [Thread-4] file.FixedFeaturesOpenSearchClient (FixedFeaturesOpenSearchClient.scala:36) - added Feature(agera520210615,Extent(-180.05, -90.05, 179.95, 90.05),2021-06-15T00:00Z,[Lorg.openeo.opensearch.OpenSearchResponses$Link;@233fe018,None,None,Some(POLYGON ((179.95 -90.05, 179.95 90.05, -180.05 90.05, -180.05 -90.05, 179.95 -90.05))),None,GeneralProperties(None,None,None,None,None),None,0.0)

Might be relevant, spent almost 20 minutes gathering the STAC items before processing started.