Open-EO / openeo-python-client

Python client API for OpenEO
https://open-eo.github.io/openeo-python-client/
Apache License 2.0
156 stars 42 forks source link

ENH: expose chunk_size parameter in download functions #524

Closed theroggy closed 10 months ago

theroggy commented 10 months ago

The chunk_size parameter is only exposed in ResultAsset.download. This PR exposes it in all download variants + adds inline doc for it.

Also worth noting: stream=True is being used, but at the time of writing this doesn't really seems to do anything: the file being downloaded is always read completely into memory anymay. More information can be found here: https://github.com/psf/requests/issues/5536

soxofaan commented 10 months ago

Hi, thanks for contributing.

Can you explain why chunk_size is important for an end user to set or finetune? It's not that the current API of the openEO python client allows to handle individual chunks. Is there some kind of performance or memory bottleneck you are trying to workaround here?

theroggy commented 10 months ago

Hi, thanks for contributing.

Can you explain why chunk_size is important for an end user to set or finetune? It's not that the current API of the openEO python client allows to handle individual chunks. Is there some kind of performance or memory bottleneck you are trying to workaround here?

It's in the first place to limit the memory usage during download. The files I'm downloading are up to 3 GB, so it's not really elegant that this is all kept in memory during the download.

A contributing factor is that downloading large files is very slow (ref https://github.com/Open-EO/openeo-geopyspark-driver/issues/631), so when a download takes up to 6 hours for 2 GB, using the chunk_size helps to get an idea if there is any progress during the download. It doesn't seem to help to solve the download performance issue though.

soxofaan commented 10 months ago

I wonder if it wouldn't be a more user friendly solution to just set a better, non-None default for chunk_size at

https://github.com/Open-EO/openeo-python-client/blob/66358cb18c753eb38961462af42a5b56d8a130af/openeo/rest/job.py#L354

instead of bothering the user with figuring out chunk_size details themselves.

If I understand https://github.com/psf/requests/issues/5536 correctly, the biggest issue is that stream=True+chunk_size=None is not a good combo at the moment. Avoiding the default chunk_size=None is probably a good thing anyway. Afterwards we can still make it configurable more widely, but nobody is probably going to need that.

soxofaan commented 10 months ago

what chunk_size values are you experimenting with at the moment? a couple of MB?

theroggy commented 10 months ago

True, if the default would have been different, I wouldn't have bothered changing it.

I am now using 50 MB chunk size, but this is just arbitrarily choosen.

soxofaan commented 10 months ago

for now, I'm going to close this one in favor of #528 then.