Closed JacobJeppesen closed 1 year ago
Just made a small change as downloads occasionally failed due a memory error when using threads for the parallel downloads. It was caused by a clash with the memory caching from joblib to cache the _fetch_tile() function, and was fixed by using processes instead of threads for the parallel download.
The processes take ~0.5s to spawn, so this added a minor delay before downloading starts, but it is still significantly faster compared to the for-loop implementation.
I wonder whether any useful default max_num_parallel_downloads
will break some services' TOS?
For example, OSM asks you to restrict yourself to two download connections at a time.. I've raised TOS-violating defaults before w.r.t. caching in #202, but I'm not sure what the maintainers (@darribas, @martinfleis) think? I'm fine with letting the user be the "violator," but we should probably document that the defaults set in the package may violate some TOS for some providers.
Good point. I've just made a new commit where the default value is changed to 1. The num_parallel_tile_downloads
and max_num_parallel_downloads
variables were also renamed to n_connections
and max_connections
, respectively, and I cleaned up the code a bit.
I also added tests of raise ValueError("n_connections must be between 1 and {max_connections}")
in bounds2img()
to address the CodeCov errors in the checks.
Yeah, agree that hardcoding max_connections
wasn't ideal. I couldn't figure out a good place to put it, but I think there should be a default limit set, so it is clear to the user to see that setting n_connections
too high can put excess stress on the tile server. I've just made a new commit where I've added it as a parameter. That is one solution, but it could also just be a warning instead of an error. Or it could it be omitted entirely, and the documentation for n_connections
could just state that the user should be careful with it.
In the same commit, I also added a parameter to disable the tile caching, as that makes using parallel connections in resource constrained environments much faster. I.e., it took a lot of time to spawn the processes in small serverless functions, where disabling the cache so the parallel download can be done with threads avoids this. I wasn't sure whether this should be added as a parameter though (or if it should be added at all).
So, in general I wasn't sure whether these changes were in line with the thoughts behind Contextily :sweat_smile: Let me know what you think, and then we might need another round of changes before its ready :slightly_smiling_face:
Sounds good :)
I've just made a new commit based on the comments. The max_connections
parameter is gone, and I've updated the docstrings with more details for n_connections
and disable_cache
.
I think so, this is great!
I would make the option use_cache=True
, rather than disable_cache=False
, since a double negative (disable
and default to False
) can be confusing, but that's the only thing I'd change.
@JacobJeppesen Can you change the keyword per @ljwolf's suggestion? It makes sense.
Yeah, I agree that the double negative should be avoided. I've changed it to use_cache=True
in the newest commit.
Let me know if there are any more changes needed :)
Thanks @JacobJeppesen!
Thanks to you too! :)
This pull request is a fix for https://github.com/geopandas/contextily/issues/215
I have added support for parallel tile downloads in the bounds2raster and bounds2img functions. It gives some quite significant speed improvements, with minor changes to the code. I wasn't sure were to put
max_num_parallel_tile_downloads = 32
, so for now it is on line 227 intile.py
(i.e., inside the bounds2img function code). Let me know if there is anything you would like to have changed.Thanks for an awesome package btw.! :smiley: