Add a metadata package schema

duncandewhurst commented 1 year ago

Presently, the top-level object in the RDLS schema is a single dataset which means that an RDLS JSON file can only contain RDLS metadata for one dataset, e.g.

{
  "id": "1",
  "title": "My dataset"
}

However, tools that process RDLS metdata (like Flatten Tool and CoVE) need to be able to accept and produce JSON files containing RDLS metadata for multiple datasets. If we leave it up to publishers to package data however they like, writing those tools becomes very complicated because they need to accept arbitrary JSON data. Therefore, we need to decide on a standard format for 'packaging' RDLS metadata for multiple datasets and include this in the RDLS documentation.

The approach used in other standards (OCDS, BODS, OFDS) is to wrap the JSON objects described by the main schema in an array. In RDLS that would be a datasets array, e.g.

{
  "datasets": [
    {
      "id": "1",
      "title": "My first dataset"
    },
    {
      "id": "2",
      "title": "My second dataset"
    }
  ]
}

This structure is analogous to the datasets sheet in the spreadsheet template and Flatten Tool already produces JSON output that conforms to this structure so this approach is already used in the tools. However, we still need to document the packaging format for the benefit of implementers that produce JSON files directly and for the benefit of developers of tools that consume RDLS data.

The downside of using an array as the package format is that common tools and libraries for parsing JSON data sometimes lack support for streaming data from JSON arrays, i.e. loading the items in the JSON array into memory one at a time. That is only an issue if the whole array is too large to fit in memory, which given the nature of RDLS metadata would likely only happen for very packages containing many thousands of datasets. My understanding is that the use cases for RDLS are mostly about sharing metadata for individual datasets so I think that using an array is fine here. Even if there was a need to share the metadata for all of the datasets in the Risk Data Library Collection as a single data file, the data should fit in memory as the quantity of datasets is small. If we later find out about use cases for sharing very large packages that cannot fit in memory, we can consider defining a secondary bulk data format with better streaming support (e.g. JSON Lines).

Proposal

Add schema/rdls_package_schema.json with the following structure:

{
  "$id": "https://raw.githubusercontent.com/GFDRR/rdl-standard/0__2__0/schema/rdls_package_schema.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "RDLS metadata package",
  "description": "A container for publishing Risk Data Library Standard Metadata.",
  "type": "object",
  "required": [
    "datasets"
  ],
  "properties": {
    "datasets": {
      "title": "Datasets",
      "description": "RDLS metadata describing one or more datasets.",
      "type": "array",
      "minItems": 1,
      "items": {
        "$ref": "https://raw.githubusercontent.com/GFDRR/rdl-standard/0__2__0/schema/rdls_schema.json"
      }
    }
  },
  "minProperties": 1
}

Add a 'Metadata package' page to the metadata reference section with a schema browser and description of the packaging format.
Add guidance on packaging to how to publish RDLS metadata.

@matamadio @stufraser1 let me know if this sounds good and I can make the above changes.

stufraser1 commented 1 year ago

Sounds like this is needed, please go ahead

duncandewhurst commented 1 year ago

Agreed to go ahead in check-in call, I'll draft some initial guidance for @stufraser1 and @matamadio to review. Guidance should focus on publishing one dataset at a time.

matamadio commented 1 year ago

Agree, for the sake of guidelines examples. Multiple datasets into one file is kinda pro feature we will need to test deeply. Good to have it, but let's not mention it in docs until fully tested.

GFDRR / rdl-standard

Add a metadata package schema #200