geojson gets downloaded 2 times from python

Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)

Apache License 2.0

25 stars 4 forks source link

geojson gets downloaded 2 times from python #783

Open jdries opened 1 month ago

jdries commented 1 month ago

artifactory access logs show that geojson files get downloaded two times, from the same ip, often within a very short time frame, by python requests. This seems like a case where the code could use a bit more caching.

JeroenVerstraelen commented 1 month ago

It looks like the issue happened because the non-standard process read_vector was used. This returns a DelayedVector with the url as path and is read several times by a few processes. Perhaps it's better to deprecate this completely and promote the use of load_url instead? load_url should read the file once into memory (when it's called). However judging by the code it currently loads it twice (once in the dry_run and once in evaluate).

Job id on vlcc: j-24052742024d46d6b651c82d510441a6

jdries commented 1 month ago

@JeroenVerstraelen deprecation is of course fine, but isn't it also possible now to just replace the implementation behind read_vector with the one from load_url? That would fix existing process graphs and reduce maintenance on our side.

JeroenVerstraelen commented 1 month ago

Not sure if that could cause small changes in behaviour of read_vector, because one uses DelayedVector and the other DriverVectorCube. I guess we need to be careful to not break old UDPs/workflows, or should it be fine?

jdries commented 4 weeks ago

we have been doing this careful migration for some time now, so I guess we should be at the point where we can simply clean up these last remaining sources of DelayedVector? I'm also not sure if there actually are differences. If there are, we may want to document them.