OCHA-DAP / hdx-cli-toolkit

A commandline tool for interacting with HDX with a view to doing bulk updates
MIT License
1 stars 0 forks source link

HDX-10119 Add "scan" functionality for analytical and maintenance operations on all of HDX #39

Closed IanHopkinson closed 4 days ago

IanHopkinson commented 3 weeks ago

Purpose

Version for this PR: 2024.8.2

The aim of this PR is to provide functionality for actions on all datasets in HDX. This uses the package_search endpoint in CKAN directly to download the whole catalogue. The actions supported are survey which counts occurrences of a key, distribution which calculates the distribution of values in a key, list which lists the values of a key for each dataset (like the existing list command) and delete_key which deletes a selected key. This final action was the original purpose of the update - to remove resources._csrf_token keys, this functionality is limited to only accept the _csrf_token and extras keys as arguments.

Under this PR some investigation was done to look at a false report of an "extras" key error. The check for the extras key in parsing the traceback is not specific enough and can mask other errors and the test is not actually testing for the error on the live site.

Major file changes

This PR adds the ckan_utilities.py and test_ckan_utilities.py files which implement the scan functionality. The DEMO.md file is renamed to USERGUIDE.md and updated to include the new command, and reorganised to reflect the more mature status of the project.

Minor file changes

tests/test_hdx_utilities_integration.py appears to have changed a lot but this is just a result of a change in line endings configuration. The new content in this file is the test function test_get_hdx_url_and_key, which tests the new get_hdx_url_and_key function.

Versioning

hdx-cli-toolkit uses the CalVer versioning scheme with format YYYY.MM.Micro i.e. 2022.12.1 which is updated manually in pyproject.toml. The "Micro" component is simply an integer increased by 1 at each version, starting from 0.

IanHopkinson commented 2 weeks ago

@turnerm I added a list action to scan which replicates the list command but applies by default to all of HDX. It is more performant than I expected <10s to show the values of 2 keys for all datasets (based on a cached version of the data). I expect it will run in >1s for you ;-)