Open jdries opened 1 month ago
It looks like the issue happened because the non-standard process read_vector was used. This returns a DelayedVector with the url as path and is read several times by a few processes. Perhaps it's better to deprecate this completely and promote the use of load_url instead? load_url should read the file once into memory (when it's called). However judging by the code it currently loads it twice (once in the dry_run and once in evaluate).
Job id on vlcc: j-24052742024d46d6b651c82d510441a6
@JeroenVerstraelen deprecation is of course fine, but isn't it also possible now to just replace the implementation behind read_vector with the one from load_url? That would fix existing process graphs and reduce maintenance on our side.
Not sure if that could cause small changes in behaviour of read_vector, because one uses DelayedVector and the other DriverVectorCube. I guess we need to be careful to not break old UDPs/workflows, or should it be fine?
we have been doing this careful migration for some time now, so I guess we should be at the point where we can simply clean up these last remaining sources of DelayedVector? I'm also not sure if there actually are differences. If there are, we may want to document them.
artifactory access logs show that geojson files get downloaded two times, from the same ip, often within a very short time frame, by python requests. This seems like a case where the code could use a bit more caching.