CLIMADA-project / climada_python

Python (3.8+) version of CLIMADA
GNU General Public License v3.0
291 stars 115 forks source link

Reducing the number of API requests using the cache. (And rate limiting?) #887

Closed ChrisFairless closed 1 week ago

ChrisFairless commented 1 month ago

Hi all,

The last few days I've really been hammering the Data API with hundreds of requests per hour.

I noticed that my client._online() returns false now, meaning I'm not successfully connecting to the API. Is this because of some rate limiting? I don't see anything about it in the documentation.

The number of requests I'm sending is actually very avoidable: almost everything I'm doing is already cached on my local system because I've used these hazards and exposures so many times before.

I would propose extending the Client._request_200 method with a cache check at the start before it makes a request to the API:

if self.cache.enabled:
    cached_result = self.cache.fetch(url, **params)
    if cached_result:
        LOGGER.info("this result is cached, loading from local storage")
        return cached_result

This isn't a perfect solution, since it means that once something is cached the user won't automatically get updated versions of the file if it changes on the API. (So maybe Client could accept a parameter that forces it to refresh the cache with every query, or there's a method that lets you refresh all the datasets in your cache, or if the cached file is more than a month old it's refreshed anyway...).

I don't know enough about API implementations to know if I've interpreted everything here correctly – if this isn't worth bothering with, just say!

ChrisFairless commented 1 month ago

After playing around with this as a possible solution I the API client chats with the API more than I thought and it's not an easy change 😄

emanuel-schmid commented 1 month ago

@ChrisFairless Thanks for the report.

To my knowledge there is no rate limiting in place. However, when the server is under high pressure, like with a dos attack, responses tend to time out. But 100 requests by per hour does not sound like hammering to me! 😁 (If you are the only heavy user, you can easily prevent timeouts by not submitting requests in parallel.)

Can you tell me more about the requests you're submitting, like a list of methods with arguments that are run? And the time around when you ran into trouble?

emanuel-schmid commented 4 weeks ago

Never mind, looking at the log files let me note this:

Filtering of datasets by property is atm implemented in a way that makes api calls slower, not faster. So if you have to get, e.g, _tropicalcyclone datasets for several countries, you're suggested to get all _tropcialcyclone datasets upfront and then loop through the countries, instead of looping through the countries and get an individual dataset for each.

Depending on what you're up to that's maybe not an option, but if it is, it will have a huge performance impact.

ChrisFairless commented 4 weeks ago

Ah yeah it could be the parallelisation sending off a lot of requests (with property filters) all at once.

I'm running things country by country so that I don't need to store large amounts of TC data in memory at one time. So one solution for me could be refactoring my code to grabs all the TC data it needs in one query right at the start, and then load it from the cache when I need it.

To do that I'm hoping that initialising

offline_client = Client()
offline_client.online = False

will force things into offline mode and let me work only from the cache.

Thanks for the tips!

emanuel-schmid commented 4 weeks ago

offline_client = Client() offline_client.online = False

Yes, that should work indeed.