GSA / sdg-indicators-usa

U.S. National Reporting Platform for the Sustainable Development Goals
https://sdg.data.gov
Other
31 stars 92 forks source link

SDMX output of SDG data #852

Open brockfanning opened 6 years ago

brockfanning commented 6 years ago

Hi @philipashlock and @Moris-OMB I wanted to run an SDMX proof-of-concept past you. The goal here is to be able to convert the SDG data from the CSV source files into SDMX-compliant output. Here is an example of some generic timeseries output.

Steps and files involved in accomplishing this

  1. Implement a naming convention for the source CSV files' column names (in the case of the example linked above, that source data would be this). The naming convention is the same is what I’ve detailed separately in the “wide vs tidy data format” issue, here. This allows the platform to be able to identify the relationships between the various columns in the CSV source.
  2. A build script that converts this source CSV file into a “tidy” format. This is not strictly necessary, but doing this allows the next steps to be re-used by any other fork of the platform that might use tidy data, like the UK platform. For reference, the build script is here and the “tidied” results, in the case of the example above, are here.
  3. A second build script that converts the tidy CSV data into JSON data that’s structured for the easiest output as SDMX: a header, and a dataset containing a list of series, each containing attributes and observations. That build script is here and an example of the JSON results, in the case of the example above, are here. This is where most of the work is done, so some more notes about this script:
    1. The JSON output is not intended to be SDMX-JSON format. It’s intended to be data that a Jekyll layout can easily use to produce SDMX output - whether SDMX-ML or SDMX-JSON.
    2. The “attributes” section of each “series” is intended to have SDMX-ready keys, like FREQ, COLL_METHOD, and UNIT_MEASURE. To accomplish this, the script needs to know which CSV column names map to which SDMX “concepts”, and which data values map to which SDMX "codes". An example of where that works successfully is here. Some details about this:
      1. The CSV column name “sex” is getting mapped to “CL_SEX”.
      2. The data value “male” is getting mapped to “M”.
      3. The mappings live in a YAML file.
      4. These mappings can also be extended/overridden for each individual indicator, by adding the same YAML to the appropriate indicator file in the _indicators folder. So, if a particular data provider for an indicator must use non-standard column names or data values, that can be accommodated.
    3. An example of where that doesn’t happen successfully, because there is no mapping for a particular CSV column name and data value, is here. So a big part of this will be figuring out the necessary SDMX identifiers and then mapping columns/values to them.
  4. Finally, the Jekyll build generates the SDMX output based on a layout. So far I’ve only added a rough-draft generic timeseries XML layout. Some notes about it:
    1. I based it roughly off some generic sample XML here.
    2. I’m sure it has plenty of issues and missing info. A few of the many uncertainties I have:
      1. Whether I've got the namespacing right
      2. If this generic timeseries approach is even appropriate/sufficient for the SDGs
      3. What else I’m missing in the header and/or series attributes
    3. This one doesn’t involve a DSD (data structure definition), which is why this is only a first step proof-of-concept.
    4. The hope is that more complex schemas, and eventually SDMX-JSON, could also be implemented using a Jekyll layout in the same way

Next steps?

  1. Decide if a different SDMX schema (as opposed to the "generic timeseries" that I've used) should be used for the SDGs.
  2. And if so, which schema should we use?
  3. If we decide that a DSD is needed, can the DSD be consistent across all NRPs? Or does it need to be...
    1. Tailored to each particular platform?
    2. Tailored to each particular country?

Longer term tasks if we move forward

  1. Implement output for SDMX-JSON
  2. Work this into the multi-lingual capabilities of the platform
AnnCorp commented 6 years ago

@brockfanning really interesting to see your progress on this as its something UK needs to look at too (probably in Spring when get our in-house developer in post). I was recently trying to find out more about UN SDMX project and found this update from SDG Working Group on SDMX from Nov - not sure if youv'e already seen this? https://unstats.un.org/sdgs/files/meetings/iaeg-sdgs-meeting-06/3.%20Update%20SDMX%20Working%20Group.pdf It mentions draft DSDs being available Nov 2017?

brockfanning commented 6 years ago

@AnnCorp I haven't heard anything new about the availability of DSDs either. But I'll keep you posted if I do. I'm still wrapping my head around SDMX, and am very curious if the DSD will be shared across all NRPs, or will need to be customized per country.

soho501 commented 6 years ago

Hi @brockfanning and @AnnCorp, we are hoping to have the pilot DSD by the end of January and our aim is that it will be valid for all NRPs, but realistically speaking it might need some customization per country.

I find this thread very interesting and there are a couple of ideas that I have in mind that wanted to share with you. I was hoping to discuss this at the January conference but I guess we can start here: At UNSD we have been working on our own data processing and dissemination system for the Global SDG data reported by agencies. We are experimenting with Data Packages for dissemination. This is what we have generated so far, it needs a bit of fine tuning since the Data Packages specification changed recently but you can get the idea. We also have a JSON REST API that allows you to query the data in different ways. please note that the API is a Test version.

Our next step is to generate SDMX messages either from the API or from the Data Packages and that's pretty much the same that you are trying to do with the CSV/JSON, so it would be nice to exchange ideas.

Apart from that I was wondering how difficult might be to adapt this platform to consume Data Packages or directly the API.. I haven't look much in deep how the platform is build so my apologies if this question is completely ridiculous. :)

brockfanning commented 6 years ago

Hi @soho501 thank you for that info. I'm looking forward to diving into that DSD to see how it can be used in this platform.

The Data Packages repo and the SDG API both look very interesting!

Currently this platform is only capable of using the data that's stored in the repo as static CSV files. But one of our development priorities is to implement some abstraction to allow the platform to get data dynamically from other sources. It sounds like the SDG API might be a great use-case to develop for.