Closed theroggy closed 10 months ago
Hi, thanks for contributing.
Can you explain why chunk_size
is important for an end user to set or finetune? It's not that the current API of the openEO python client allows to handle individual chunks. Is there some kind of performance or memory bottleneck you are trying to workaround here?
Hi, thanks for contributing.
Can you explain why
chunk_size
is important for an end user to set or finetune? It's not that the current API of the openEO python client allows to handle individual chunks. Is there some kind of performance or memory bottleneck you are trying to workaround here?
It's in the first place to limit the memory usage during download. The files I'm downloading are up to 3 GB, so it's not really elegant that this is all kept in memory during the download.
A contributing factor is that downloading large files is very slow (ref https://github.com/Open-EO/openeo-geopyspark-driver/issues/631), so when a download takes up to 6 hours for 2 GB, using the chunk_size helps to get an idea if there is any progress during the download. It doesn't seem to help to solve the download performance issue though.
I wonder if it wouldn't be a more user friendly solution to just set a better, non-None
default for chunk_size
at
instead of bothering the user with figuring out chunk_size
details themselves.
If I understand https://github.com/psf/requests/issues/5536 correctly, the biggest issue is that stream=True
+chunk_size=None
is not a good combo at the moment.
Avoiding the default chunk_size=None
is probably a good thing anyway.
Afterwards we can still make it configurable more widely, but nobody is probably going to need that.
what chunk_size values are you experimenting with at the moment? a couple of MB?
True, if the default would have been different, I wouldn't have bothered changing it.
I am now using 50 MB chunk size, but this is just arbitrarily choosen.
for now, I'm going to close this one in favor of #528 then.
The
chunk_size
parameter is only exposed inResultAsset.download
. This PR exposes it in all download variants + adds inline doc for it.Also worth noting:
stream=True
is being used, but at the time of writing this doesn't really seems to do anything: the file being downloaded is always read completely into memory anymay. More information can be found here: https://github.com/psf/requests/issues/5536