geopandas / contextily

Context geo-tiles in Python
https://contextily.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
493 stars 81 forks source link

Added support for parallel tile downloads and control of cache #217

Closed JacobJeppesen closed 1 year ago

JacobJeppesen commented 1 year ago

This pull request is a fix for https://github.com/geopandas/contextily/issues/215

I have added support for parallel tile downloads in the bounds2raster and bounds2img functions. It gives some quite significant speed improvements, with minor changes to the code. I wasn't sure were to put max_num_parallel_tile_downloads = 32, so for now it is on line 227 in tile.py (i.e., inside the bounds2img function code). Let me know if there is anything you would like to have changed.

Thanks for an awesome package btw.! :smiley:

JacobJeppesen commented 1 year ago

Just made a small change as downloads occasionally failed due a memory error when using threads for the parallel downloads. It was caused by a clash with the memory caching from joblib to cache the _fetch_tile() function, and was fixed by using processes instead of threads for the parallel download.

The processes take ~0.5s to spawn, so this added a minor delay before downloading starts, but it is still significantly faster compared to the for-loop implementation.

ljwolf commented 1 year ago

I wonder whether any useful default max_num_parallel_downloads will break some services' TOS?

For example, OSM asks you to restrict yourself to two download connections at a time.. I've raised TOS-violating defaults before w.r.t. caching in #202, but I'm not sure what the maintainers (@darribas, @martinfleis) think? I'm fine with letting the user be the "violator," but we should probably document that the defaults set in the package may violate some TOS for some providers.

JacobJeppesen commented 1 year ago

Good point. I've just made a new commit where the default value is changed to 1. The num_parallel_tile_downloads and max_num_parallel_downloads variables were also renamed to n_connections and max_connections, respectively, and I cleaned up the code a bit.

I also added tests of raise ValueError("n_connections must be between 1 and {max_connections}") in bounds2img() to address the CodeCov errors in the checks.

JacobJeppesen commented 1 year ago

Yeah, agree that hardcoding max_connections wasn't ideal. I couldn't figure out a good place to put it, but I think there should be a default limit set, so it is clear to the user to see that setting n_connections too high can put excess stress on the tile server. I've just made a new commit where I've added it as a parameter. That is one solution, but it could also just be a warning instead of an error. Or it could it be omitted entirely, and the documentation for n_connections could just state that the user should be careful with it.

In the same commit, I also added a parameter to disable the tile caching, as that makes using parallel connections in resource constrained environments much faster. I.e., it took a lot of time to spawn the processes in small serverless functions, where disabling the cache so the parallel download can be done with threads avoids this. I wasn't sure whether this should be added as a parameter though (or if it should be added at all).

So, in general I wasn't sure whether these changes were in line with the thoughts behind Contextily :sweat_smile: Let me know what you think, and then we might need another round of changes before its ready :slightly_smiling_face:

JacobJeppesen commented 1 year ago

Sounds good :)

I've just made a new commit based on the comments. The max_connections parameter is gone, and I've updated the docstrings with more details for n_connections and disable_cache.

ljwolf commented 1 year ago

I think so, this is great!

I would make the option use_cache=True, rather than disable_cache=False, since a double negative (disable and default to False) can be confusing, but that's the only thing I'd change.

martinfleis commented 1 year ago

@JacobJeppesen Can you change the keyword per @ljwolf's suggestion? It makes sense.

JacobJeppesen commented 1 year ago

Yeah, I agree that the double negative should be avoided. I've changed it to use_cache=True in the newest commit.

Let me know if there are any more changes needed :)

martinfleis commented 1 year ago

Thanks @JacobJeppesen!

JacobJeppesen commented 1 year ago

Thanks to you too! :)