StatSocAus / oceaniaR-hack

OceaniaR Hackathon 2024
7 stars 0 forks source link

Improved Package for R SDMX Integration for Pacific Data #12

Open deanmarchiori opened 1 month ago

deanmarchiori commented 1 month ago

{rsdmx} is an existing library exists to parse/read SDMX data and metadata in R. See here: https://github.com/opensdmx/rsdmx

This package is extensive but not tailored for use by Pacific Statistical Agencies and experts.

Could a fork or domain specific implementation of this tool be achieved, specifically to improve usability for Pacific statistical data.

Further Context

Existing Help docs: https://docs.pacificdata.org/dotstat/plugins/r

deanmarchiori commented 1 month ago

Could the {validate} package be integrated for further data validation checks?

deanmarchiori commented 1 month ago

Comment from The Pacific Community staff members:

Another (and probably better idea) is a more robust set of wrapper functions around the rsdmx tools for dot Stat. In particular I find myself using this complete hack of a function: https://github.com/PacificCommunity/sdd-analysis-misc/blob/main/R/pdh_get_codelists.R to get the codelists from the metadata. It's brittle and not really using the API as intended but I've not had time to do a better version. ideally I'd want to specify a PDH data flow and get it and its code lists as one.

gvdr commented 4 weeks ago

Bringing some detail and love to this issue :-)

We at the Pacific Community | Communauté du Pacifique disseminate a treasure trove of Pacific data through our SPC .Stat portal. The data is available in a very robust model, SDMX , adopted globally by many other international organizations (UN, WHO, ILO, Eurostat, ...); the datasets are accessible through well-documented Rest(ish) API and a Data Explorer web interfaced.

SDKs for SDMX exists in various languages, including a couple in R. However, they tend to be very general, and more looking forward the SDMX-savvy user: some simple tasks are less easy to do that one would like (i.e., collecting all the data for a specific country or set of countries, extracting metadata in a tidy format, ...). A package covering the gap left between {rsdmx} and the users will be great.

Such a package can then become the building ground for many other projects (shiny integration, quarto templates, ...) and would have an incredible impact in the Pacific community of users.

As for what might be needed to build (think in terms of things you might learn, rather than prereqs for the project):

MilesMcBain commented 3 weeks ago

This sounds like my kind of fun!

gvdr commented 2 weeks ago

Thanks to the invaluable work of @shandiya @MilesMcBain and @lawremi the project is now pushing forward here: https://github.com/PacificCommunity/pdh-stat-pecheuse

gvdr commented 2 weeks ago

PS in the morning-after state of mind, I realised that what we were trying to build here is a "tidysdmx" kind of package. Which I believe is something fantastic to have! It would provide a huge contribution to all the official statistics community.

MilesMcBain commented 2 weeks ago

Yeah I agree that there’s a lot of work to get to the kind of interface we discussed that would be immediately useful outside the Pacific. Hooray for standards!

Im still unable to reconcile myself to choosing between going all in on something like BioC’s DataFrame with a separate metadata data.frame: https://carpentries-incubator.github.io/bioc-project/05-s4.html#metadata-columns

Or trying to pack metadata into a regular tibble somehow.

One thought I had is that if you veer into custom compound objects you may create challenges on the serialisation / deserialisation side.

gvdr commented 2 weeks ago

Im still unable to reconcile myself to choosing between going all in on something like BioC’s DataFrame with a separate metadata data.frame: https://carpentries-incubator.github.io/bioc-project/05-s4.html#metadata-columns

Or trying to pack metadata into a regular tibble somehow.

Agree with the way you lean toward a simple tibble (or tibble+, but keeping full compatibility with tibble). If the idea is to open as much to the usage in tidy data, that is the way to go, I think.

Which suggest to split this endeavour in two main trunks:

upstream:

downstream: