OCHA-DAP / hdx-python-api

Python API for interacting with the HDX Data Portal
http://data.humdata.org
MIT License
80 stars 16 forks source link

Are there rate limits in the API? #44

Closed lfagliano closed 1 year ago

lfagliano commented 1 year ago

Hello!

Thank you for developing the package, it aids quite well in uploading data to HDX.

Regularly, on a weekly basis, I upload around 200 datasets to HDX. However, every week I encounter the same error, and I don't know exactly where it comes from unless there are rate limits.

The error is this: hdx.data.hdxobject.HDXError: Failed when trying to read: id=e1df3a45-5052-4ef0-bc68-12d887286d35! (POST) (the id number varies)

This comes at different moments in the script, just before a country is processed by get_resources(). This has been an issue for some weeks (even months), and I have started to realize it happens after a specific number of countries are uploaded. Giving me a stronger impression of rate limits being the culprit. Furthermore, usually waiting for a minute and then re-running the script solves the problem.

Yet, I couldn't find any documentation about rate limits. Hence my question, are there rate limits?

mcarans commented 1 year ago

@lfagliano Thank you for raising this issue. I have talked to our devs and there is no rate limit on the HDX platform. However, the error looks like one related to networking with HDX. Next time it happens, please can you record the full error message and any other info useful to debugging. Is your script something I can take a look at?

lfagliano commented 1 year ago

Hi! Thanks for your reply! I think the issue occurs just after updating a full country resource. Because the datasets are updated, but it just brings an error at the end. When I start with the next country in the list, there is no error, so I doubt the error is coming from reading a new country.

Some info:

Here is the full traceback:

Traceback (most recent call last):
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\hdxobject.py", line 112, in _read_from_hdx
    result = self.configuration.call_remoteckan(action, data)
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\api\configuration.py", line 372, in call_remoteckan
    return self.remoteckan().call_action(*args, **kwargs)
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\ckanapi\remoteckan.py", line 97, in call_action
    return reverse_apicontroller_action(url, status, response)
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\ckanapi\common.py", line 134, in reverse_apicontroller_action
    raise CKANAPIError(repr([url, status, response]))
ckanapi.errors.CKANAPIError: ['https://data.humdata.org/api/action/package_hxl_update', 500, '{"help": "https://data.humdata.org/api/3/action/help_show?name=package_hxl_update", "error": {"__type": "Internal Server Error", "message": "Internal Server Error"}, "success": false}']

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "Scripts\run_win.py", line 51, in <module>
    facade(main, user_agent_config_yaml = './config/.user_agents.yml',
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\facades\simple.py", line 37, in facade
    projectmainfn()
  File "Scripts\run_win.py", line 44, in main
    file_prep_test.update_all_datasets(Configuration.read(), update_global = True, hrp= True, hrp_list=HRP_23, country = "Malta")
  File "Scripts\file_prep_test.py", line 249, in update_all_datasets
    update_country_dataset(dataset,country,last_friday, dict_for_hrp, hrp=True)
  File "Scripts\file_prep_test.py", line 347, in update_country_dataset
    dataset.update_in_hdx()
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\dataset.py", line 916, in update_in_hdx
    self._dataset_merge_hdx_update(
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\dataset.py", line 862, in _dataset_merge_hdx_update
    return self._save_dataset_add_filestore_resources(
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\dataset.py", line 702, in _save_dataset_add_filestore_resources
    self.hxl_update()
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\dataset.py", line 1020, in hxl_update
    self._read_from_hdx(
  File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\hdxobject.py", line 117, in _read_from_hdx
    raise HDXError(
hdx.data.hdxobject.HDXError: Failed when trying to read: id=a58b1b59-47ab-45ca-b23f-268b75c9b83c! (POST)

As for the script, it is complicated as it is divided into multiple scripts. But here are the functions dealing with networking with HDX:

import logging
import os
import pandas as pd
import polars as pl
from datetime import date, timedelta, datetime
from slugify import slugify
import shutil
import openpyxl
import hrp_file_generation
import non_hrp_file_generation
from hdx.data.dataset import Dataset
from hdx.data.resource import Resource
import copy
import platform

def update_country_dataset(dataset, country, last_friday, df_dict, hrp = False):
    """Update existing country dataset with updated resources

    Parameters 
    ----------

    Args: 
        dataset: HDXObject
            HDX dataset

        country: str
            Country being updated

        last_friday: str
            Last friday date, the result of get_last_friday

        df_dict: dict
            Dict with file, HRP or non-HRP. 

        hrp: bool
            Option whether to update hrp files. 

    Returns:
    Updated country dataset

    """
    # Get existing resources
    resources = dataset.get_resources()

    # Change to windows
    if platform.system() == 'Windows':
        acled_countries = pl.read_csv('acledcountries.csv')
    else:
        acled_countries = pl.read_csv('acledcountries.csv')

    min_year = acled_countries.filter(pl.col("country") == country)["start_year"].item()

    for resource in resources:
        if "month-year" in resource['name']:
            if hrp is True:
                if "_HRP_" in resource['name']:
                    name = resource['name'].replace(f'{country.lower()}_HRP_', '')
                else:
                    name = resource['name'].replace(f'{country.lower()}_hrp_', '')
            else:
                name = resource['name'].replace(f'{country.lower()}_', '')
        else:
            continue

        as_of = date.today().strftime('%d%b%Y')

        if hrp is True:
            new_file_path = f'./save_temp_resources/{country.lower()}_HRP_{name}_as-of-{as_of}.xlsx'
        else:
            new_file_path = f'./save_temp_resources/{country.lower()}_{name}_as-of-{as_of}.xlsx'

        # Convert dataframes to XLSX with ACLED's Template
        save_stuff(df = df_dict[name].filter(pl.col("Country") == country), new_file_path = new_file_path)

        resource.set_file_to_upload(file_to_upload=new_file_path)

    dataset.set_reference_period(datetime.strptime(f'{min_year}-01-01', '%Y-%m-%d'), datetime.combine(last_friday, datetime.min.time()), False)

    dataset.update_in_hdx()

    print(f'{country} dataset successfully updated.')
def update_all_datasets(config, update_global = True, hrp = True, country = None, hrp_list = None):
    """Update country datasets or create if dataset doesn't exist; update global datasets

    country is the next country to upload. For that, go to data.humddata , and check the datasets that needs to be updated. Workaround to POST id problem.

    # dict_for_countries & dict_from_hrp

    Parameters
    ----------
    Args:
        config: path, config
            HDX config path, where we have stored the configuration of our datasets. 
        update_global: bool
            Option on whether to upload the global files. 
        hrp: bool
            Option on whether to also include HRP countries in the upload. 
        country: str
            Country where to continue the upload. 
        hrp_list: list
            List of HRP countries

    Returns: 
    Updated files in HDX

    """

    HRP_23 = hrp_list

    if hrp == True:

        if country is not None:
            follow = True
        else:
            follow = False

        dict_for_hrp, global_hrp = hrp_file_generation.generate_hrp_files(hrp_list=HRP_23, follow_up=follow)

        dict_for_countries, global_dict = non_hrp_file_generation.generate_non_hrp_files(hrp_list = HRP_23, follow_up=follow) 

    # Get date of previous Friday for updating dataset date ranges
    last_friday = get_last_friday()

    # Get country list from ACLED's master country list. 
    country_list = get_countries('credential1', 'credential2')

    # Use this to continue from where it failed (POST id problem)
    if country is not None:
        country_list = country_list[country_list.index(country):]

    # Create folder for files
    if not os.path.exists('./save_temp_resources'):
        os.mkdir('./save_temp_resources')

    if hrp == True:
        for country in country_list:
            # Checking if the country is an HRP, if so, we upload the special dataset. 
            if country.lower() in HRP_23:
                dataset = Dataset.read_from_hdx(f'{slugify(country.lower())}-acled-conflict-data')
                update_country_dataset(dataset,country,last_friday, dict_for_hrp, hrp=True)
            else:
                dataset = Dataset.read_from_hdx(f'{slugify(country.lower())}-acled-conflict-data')

                if dataset:
                    update_country_dataset(dataset, country, last_friday, dict_for_countries)
                else:
                    create_country_dataset(country, last_friday, dict_for_countries)

        if update_global == True:
            global_datasets = ["demonstration-events", "political-violence-events-and-fatalities", "civilian-targeting-events-and-fatalities"]

            for dataset in global_datasets:
                update_global_dataset(hrp_global = global_hrp, dict_global = global_dict, type= dataset, last_friday = last_friday)
    else:
        for country in country_list:
            # if country.lower() in HRP_23:
            #     continue
            # else:
                dataset = Dataset.read_from_hdx(f'{slugify(country.lower())}-acled-conflict-data')

                if dataset:
                    update_country_dataset(dataset, country, last_friday, dict_for_countries)
                else:
                    create_country_dataset(country, last_friday, dict_for_countries)

        if update_global == True:
            global_datasets = ["demonstration-events", "political-violence-events-and-fatalities", "civilian-targeting-events-and-fatalities"]

            for dataset in global_datasets:
                update_global_dataset(hrp_global = global_hrp, dict_global = global_dict, type= dataset, last_friday = last_friday)

    if os.path.exists('./save_temp_resources'):
        shutil.rmtree('./save_temp_resources')
HRP_23 = [
            "afghanistan",
            "burkina faso",
            "burundi",
            "cameroon",
            "central african republic",
            "chad",
            "colombia",
            "democratic republic of congo",
            "ethiopia",
            "haiti",
            "mali",
            "mozambique",
            "myanmar",
            "niger",
            "nigeria",
            "palestine",
            "somalia",
            "south sudan",
            "sudan",
            "syria",
            "ukraine",
            "venezuela",
            "yemen"]

def main():
     # Update all datasets (global last)
     file_prep.update_all_datasets(Configuration.read(), update_global = True, hrp= True, hrp_list=HRP_23, country = "Niue")

if __name__ == '__main__':
    facade(main, user_agent_config_yaml = './config/.user_agents.yml',
            hdx_config_yaml = './config/.hdx_config_yaml.yml',
            project_config_yaml = './config/.project_configuration.yml')

Would this be of any help?

mcarans commented 1 year ago

@lfagliano It may be necessary to add a small delay between dataset creates/updates, but there are some other things to try first.

I don't think you need to have separate flows for create and update. You can combine them into one. After setting up the dataset including giving it the appropriate name eg. f'{slugify(country.lower())}-acled-conflict-data', you can call dataset.create_in_hdx. That reads any existing dataset and updates it or creates a new dataset if no dataset exists. You can use the create call with some additional parameters:

                    dataset.create_in_hdx(
                        remove_additional_resources=True,
                        hxl_update=False,
                    )

remove_additional_resources=True ensures that if you change the resources in a dataset, old ones that aren't being updated are not left behind (assuming that's what you want).

hxl_update=False turns off updating QuickCharts. I looked at a couple of ACLED datasets on HDX and there are no QuickCharts, so turning this off removes one extra call to HDX.

These steps should hopefully reduce the number of calls per country to HDX such that the script completes successfully, but if not, then adding delays may be necessary.

mcarans commented 1 year ago

@lfagliano Have you been able to resolve the problem? If so, then please close this issue.

lfagliano commented 1 year ago

Hi @mcarans

Sorry for not answering it before. The changes you proposed reduced a lot of the errors. However, sporadically, I have seen the same errors appear (but not in the same countries as before), which I assume is a question of connection stability right now.

I was hoping to see how the issue evolved to come back with an answer. In short, it looks like the problem was fixed. If the errors become more constant, I can reopen the issue.

mcarans commented 9 months ago

@lfagliano HDX Python API 6.2.0 now automatically rate limits calls to HDX (which now has rate limits).