epogrebnyak / weo-reader

Python client to read IMF World Economic Outlook (WEO) dataset as pandas dataframe.
31 stars 12 forks source link

may use SDMX interface #39

Open epogrebnyak opened 6 months ago

epogrebnyak commented 6 months ago

See tools developped by @ONEcampaign, @jm-rivera, @lpicci96 team:

https://github.com/ONEcampaign/bblocks/blob/main/bblocks/import_tools/imf_weo.py

lpicci96 commented 6 months ago

Thanks for opening this issue @epogrebnyak

The SDMX approach has some advantages, most importantly standardisation of data and metadata. It also bring in both national and regional data together, which the package currently lacks. There are some limitations. SDMX data is not available before 2017, this can be a limiting factor because a big advantage of the package is allowing access to historical releases. To keep this advantage you would need to integrate both SDMX and xls data together. I had also come across an issue with one of the 2021 releases (corrupted files) and I'm not sure if the files had been fixed by the IMF team.

In terms of implementation, this could be problematic because the field names are slightly different between the SDMX data and the xls data. There is also some renaming and reformatting being done by the package that would need to be refactored. I'm not sure all that would need to change but I imagine there would be some breaking changes to the UI. One example is the country functionality. In the downloaded data from the package there is also a column Country which isn't suitable for both country and regional data together. Generating and handling iso3 codes would also need to be amended to handle regions.

In terms of the priorities for our work at the ONE Campaign, we are most interested in the data extraction bit which becomes a component of our ETL. We would need to be quite reactive to new releases as well so we would still likely rely on some of our own tooling in that process, in case of breakages and need for maintenance of the tool. The advantages of weo-reader are 1. access historical releases 2. interactivity with data in a user-friendly way 3. the potential for more advanced analysis tools (which would could make the package eligible for JOSS).

To benefit both of our purposes, I propose I repackage the tool we created into a thin api for the SDMX data, which weo-reader can wrap. This way it is easier to start integrating the SDMX data while keeping all the existing functionality.

There are some other enhancements to weo-reader that I think could be interesting to pursue. Of course there is the addition of regional data, even through the xls files. The other is handling the downloaded data. Having all the raw data saved to disk is useful, but at times users don't need the raw data file and may prefer not having to go through the download step. There are ways to bypass saving the data to disk, and caching to prevent multiple redundant downloads. There could be a save_to_disk method of some kind to save the raw data. I think this could be a useful feature and happy to help.

Let me know your thoughts on my proposition and if you have other ideas

epogrebnyak commented 6 months ago

All good ideas, what is the entry point for SDMX and how is it documented?

lpicci96 commented 6 months ago

The releases come along with a SDMX Data Structure Definition. I would start there. This helper class we created parses the data to a dataframe. You can look at our implementation there

epogrebnyak commented 6 months ago

You can look at our implementation there

Is main action happening here? The SDMX is a zip file and then you process it into a dataframe? Is it roughly a URL -> ZipFile -> pd.DataFrame?

https://github.com/ONEcampaign/bblocks/blob/93da6b0175c0efdf9530826b64f747d5d6085d8e/bblocks/import_tools/imf_weo.py#L142-L149

lpicci96 commented 6 months ago

Partly yes. The full extraction pipeline is run by this function extract_data which takes a WEO version as a parameter. First it will find the href for the SDMX data. We use some webscraping instead of hardcoding the url in case of changes. If the href is found, it will make a request and store the response content as a ZipFile object. The zipfile object is then parsed using the Parser helper class to get that data as a DataFrame. extract_data is injected into the WEO class which is the main UI. When the data needs to be downloaded (either if the data has never been downloaded or the user wants to refresh) the extraction function is called.