Closed lfagliano closed 1 year ago
@lfagliano Thank you for raising this issue. I have talked to our devs and there is no rate limit on the HDX platform. However, the error looks like one related to networking with HDX. Next time it happens, please can you record the full error message and any other info useful to debugging. Is your script something I can take a look at?
Hi! Thanks for your reply! I think the issue occurs just after updating a full country resource. Because the datasets are updated, but it just brings an error at the end. When I start with the next country in the list, there is no error, so I doubt the error is coming from reading a new country.
Some info:
Here is the full traceback:
Traceback (most recent call last):
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\hdxobject.py", line 112, in _read_from_hdx
result = self.configuration.call_remoteckan(action, data)
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\api\configuration.py", line 372, in call_remoteckan
return self.remoteckan().call_action(*args, **kwargs)
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\ckanapi\remoteckan.py", line 97, in call_action
return reverse_apicontroller_action(url, status, response)
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\ckanapi\common.py", line 134, in reverse_apicontroller_action
raise CKANAPIError(repr([url, status, response]))
ckanapi.errors.CKANAPIError: ['https://data.humdata.org/api/action/package_hxl_update', 500, '{"help": "https://data.humdata.org/api/3/action/help_show?name=package_hxl_update", "error": {"__type": "Internal Server Error", "message": "Internal Server Error"}, "success": false}']
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "Scripts\run_win.py", line 51, in <module>
facade(main, user_agent_config_yaml = './config/.user_agents.yml',
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\facades\simple.py", line 37, in facade
projectmainfn()
File "Scripts\run_win.py", line 44, in main
file_prep_test.update_all_datasets(Configuration.read(), update_global = True, hrp= True, hrp_list=HRP_23, country = "Malta")
File "Scripts\file_prep_test.py", line 249, in update_all_datasets
update_country_dataset(dataset,country,last_friday, dict_for_hrp, hrp=True)
File "Scripts\file_prep_test.py", line 347, in update_country_dataset
dataset.update_in_hdx()
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\dataset.py", line 916, in update_in_hdx
self._dataset_merge_hdx_update(
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\dataset.py", line 862, in _dataset_merge_hdx_update
return self._save_dataset_add_filestore_resources(
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\dataset.py", line 702, in _save_dataset_add_filestore_resources
self.hxl_update()
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\dataset.py", line 1020, in hxl_update
self._read_from_hdx(
File "Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\hdx\data\hdxobject.py", line 117, in _read_from_hdx
raise HDXError(
hdx.data.hdxobject.HDXError: Failed when trying to read: id=a58b1b59-47ab-45ca-b23f-268b75c9b83c! (POST)
As for the script, it is complicated as it is divided into multiple scripts. But here are the functions dealing with networking with HDX:
import logging
import os
import pandas as pd
import polars as pl
from datetime import date, timedelta, datetime
from slugify import slugify
import shutil
import openpyxl
import hrp_file_generation
import non_hrp_file_generation
from hdx.data.dataset import Dataset
from hdx.data.resource import Resource
import copy
import platform
def update_country_dataset(dataset, country, last_friday, df_dict, hrp = False):
"""Update existing country dataset with updated resources
Parameters
----------
Args:
dataset: HDXObject
HDX dataset
country: str
Country being updated
last_friday: str
Last friday date, the result of get_last_friday
df_dict: dict
Dict with file, HRP or non-HRP.
hrp: bool
Option whether to update hrp files.
Returns:
Updated country dataset
"""
# Get existing resources
resources = dataset.get_resources()
# Change to windows
if platform.system() == 'Windows':
acled_countries = pl.read_csv('acledcountries.csv')
else:
acled_countries = pl.read_csv('acledcountries.csv')
min_year = acled_countries.filter(pl.col("country") == country)["start_year"].item()
for resource in resources:
if "month-year" in resource['name']:
if hrp is True:
if "_HRP_" in resource['name']:
name = resource['name'].replace(f'{country.lower()}_HRP_', '')
else:
name = resource['name'].replace(f'{country.lower()}_hrp_', '')
else:
name = resource['name'].replace(f'{country.lower()}_', '')
else:
continue
as_of = date.today().strftime('%d%b%Y')
if hrp is True:
new_file_path = f'./save_temp_resources/{country.lower()}_HRP_{name}_as-of-{as_of}.xlsx'
else:
new_file_path = f'./save_temp_resources/{country.lower()}_{name}_as-of-{as_of}.xlsx'
# Convert dataframes to XLSX with ACLED's Template
save_stuff(df = df_dict[name].filter(pl.col("Country") == country), new_file_path = new_file_path)
resource.set_file_to_upload(file_to_upload=new_file_path)
dataset.set_reference_period(datetime.strptime(f'{min_year}-01-01', '%Y-%m-%d'), datetime.combine(last_friday, datetime.min.time()), False)
dataset.update_in_hdx()
print(f'{country} dataset successfully updated.')
def update_all_datasets(config, update_global = True, hrp = True, country = None, hrp_list = None):
"""Update country datasets or create if dataset doesn't exist; update global datasets
country is the next country to upload. For that, go to data.humddata , and check the datasets that needs to be updated. Workaround to POST id problem.
# dict_for_countries & dict_from_hrp
Parameters
----------
Args:
config: path, config
HDX config path, where we have stored the configuration of our datasets.
update_global: bool
Option on whether to upload the global files.
hrp: bool
Option on whether to also include HRP countries in the upload.
country: str
Country where to continue the upload.
hrp_list: list
List of HRP countries
Returns:
Updated files in HDX
"""
HRP_23 = hrp_list
if hrp == True:
if country is not None:
follow = True
else:
follow = False
dict_for_hrp, global_hrp = hrp_file_generation.generate_hrp_files(hrp_list=HRP_23, follow_up=follow)
dict_for_countries, global_dict = non_hrp_file_generation.generate_non_hrp_files(hrp_list = HRP_23, follow_up=follow)
# Get date of previous Friday for updating dataset date ranges
last_friday = get_last_friday()
# Get country list from ACLED's master country list.
country_list = get_countries('credential1', 'credential2')
# Use this to continue from where it failed (POST id problem)
if country is not None:
country_list = country_list[country_list.index(country):]
# Create folder for files
if not os.path.exists('./save_temp_resources'):
os.mkdir('./save_temp_resources')
if hrp == True:
for country in country_list:
# Checking if the country is an HRP, if so, we upload the special dataset.
if country.lower() in HRP_23:
dataset = Dataset.read_from_hdx(f'{slugify(country.lower())}-acled-conflict-data')
update_country_dataset(dataset,country,last_friday, dict_for_hrp, hrp=True)
else:
dataset = Dataset.read_from_hdx(f'{slugify(country.lower())}-acled-conflict-data')
if dataset:
update_country_dataset(dataset, country, last_friday, dict_for_countries)
else:
create_country_dataset(country, last_friday, dict_for_countries)
if update_global == True:
global_datasets = ["demonstration-events", "political-violence-events-and-fatalities", "civilian-targeting-events-and-fatalities"]
for dataset in global_datasets:
update_global_dataset(hrp_global = global_hrp, dict_global = global_dict, type= dataset, last_friday = last_friday)
else:
for country in country_list:
# if country.lower() in HRP_23:
# continue
# else:
dataset = Dataset.read_from_hdx(f'{slugify(country.lower())}-acled-conflict-data')
if dataset:
update_country_dataset(dataset, country, last_friday, dict_for_countries)
else:
create_country_dataset(country, last_friday, dict_for_countries)
if update_global == True:
global_datasets = ["demonstration-events", "political-violence-events-and-fatalities", "civilian-targeting-events-and-fatalities"]
for dataset in global_datasets:
update_global_dataset(hrp_global = global_hrp, dict_global = global_dict, type= dataset, last_friday = last_friday)
if os.path.exists('./save_temp_resources'):
shutil.rmtree('./save_temp_resources')
HRP_23 = [
"afghanistan",
"burkina faso",
"burundi",
"cameroon",
"central african republic",
"chad",
"colombia",
"democratic republic of congo",
"ethiopia",
"haiti",
"mali",
"mozambique",
"myanmar",
"niger",
"nigeria",
"palestine",
"somalia",
"south sudan",
"sudan",
"syria",
"ukraine",
"venezuela",
"yemen"]
def main():
# Update all datasets (global last)
file_prep.update_all_datasets(Configuration.read(), update_global = True, hrp= True, hrp_list=HRP_23, country = "Niue")
if __name__ == '__main__':
facade(main, user_agent_config_yaml = './config/.user_agents.yml',
hdx_config_yaml = './config/.hdx_config_yaml.yml',
project_config_yaml = './config/.project_configuration.yml')
Would this be of any help?
@lfagliano It may be necessary to add a small delay between dataset creates/updates, but there are some other things to try first.
I don't think you need to have separate flows for create and update. You can combine them into one. After setting up the dataset including giving it the appropriate name eg. f'{slugify(country.lower())}-acled-conflict-data'
, you can call dataset.create_in_hdx
. That reads any existing dataset and updates it or creates a new dataset if no dataset exists. You can use the create call with some additional parameters:
dataset.create_in_hdx(
remove_additional_resources=True,
hxl_update=False,
)
remove_additional_resources=True
ensures that if you change the resources in a dataset, old ones that aren't being updated are not left behind (assuming that's what you want).
hxl_update=False
turns off updating QuickCharts. I looked at a couple of ACLED datasets on HDX and there are no QuickCharts, so turning this off removes one extra call to HDX.
These steps should hopefully reduce the number of calls per country to HDX such that the script completes successfully, but if not, then adding delays may be necessary.
@lfagliano Have you been able to resolve the problem? If so, then please close this issue.
Hi @mcarans
Sorry for not answering it before. The changes you proposed reduced a lot of the errors. However, sporadically, I have seen the same errors appear (but not in the same countries as before), which I assume is a question of connection stability right now.
I was hoping to see how the issue evolved to come back with an answer. In short, it looks like the problem was fixed. If the errors become more constant, I can reopen the issue.
@lfagliano HDX Python API 6.2.0 now automatically rate limits calls to HDX (which now has rate limits).
Hello!
Thank you for developing the package, it aids quite well in uploading data to HDX.
Regularly, on a weekly basis, I upload around 200 datasets to HDX. However, every week I encounter the same error, and I don't know exactly where it comes from unless there are rate limits.
The error is this:
hdx.data.hdxobject.HDXError: Failed when trying to read: id=e1df3a45-5052-4ef0-bc68-12d887286d35! (POST)
(the id number varies)This comes at different moments in the script, just before a country is processed by get_resources(). This has been an issue for some weeks (even months), and I have started to realize it happens after a specific number of countries are uploaded. Giving me a stronger impression of rate limits being the culprit. Furthermore, usually waiting for a minute and then re-running the script solves the problem.
Yet, I couldn't find any documentation about rate limits. Hence my question, are there rate limits?