How to manage updates to data over time

Copied from Issue #17

A couple questions came up in our KT roundtable today about how the LabourExploitationRisk BHoM class and coupled dataset(s) can be designed to effectively manage updates to the data (e.g. as new import/export and GSI data become available each year) and to the schema (e.g. as we incorporate new indicators that were not present in previous versions).

The first, more prosaic, question concerns how we establish a pipeline to ensure new datasets can be compiled and pushed into BHoM efficiently and with minimal error. When we encounter a need to compile/process external data sources that adhere to consistent schemas (e.g. EPW file, EIA data, etc), we generally develop scripts as part of a version-controlled repo. This way as new data become available, we can simply download the new data, re-run those scripts, and diff the output (ideally beautified as well as a minified version so that we can easily spot check changes). Such scripts will often provide summary statistics and warnings to guide users to where further checks are needed, often leading to modification of the script (e.g. as the schema of the underlying data source may have changed slightly). If we find ourselves having to replace bits of text on a regular basis for consistency between datasets, we'll write a mapping function to make that explicit and to save ourselves the trouble of subsequent manual rewrites (e.g. "U.K." maps to "United Kingdom"). We think codifying and version-controlling how we process external datasources (such as the import/export data) would really benefit the LabourExploitationRisk dataset pipeline.
The second question concerns how periodic changes to the LabourExploitationRisk data that get pushed to BHoM will affect users' past models. Assuming we get new data in 2023 for imports/exports and GSI, should that data automatically replace what was there previously, potentially causing users' data to change when they next fire up BHoM in Gh/Excel? Or should users be allowed to select which version of the data (semantic version? release year?) they are using?
Related to the second question, what if the data schema for the LabourExploitationRisk class needs to change (e.g. due to a new indicator we want to use)? If the answer to 2 was to maintain an archive of past datasets, we'll need to ensure that the LabourExploitationRisk class can effectively consume them. Based on our last conversation, it seemed that there must be absolute agreement between the class and the JSON schema for the associated data, which suggests that we would have to either:
1. back-port the old datasets (filling in lots of nulls or double.NaN's where information was missing), or
2. adopt a class structure that abstracts the particular indicators of interest so that each country-object has an array of indicators, with a key, value, and description, the key of which would be e.g. "VictimsOfModernSlavery". That way adding a new indicator would not force a change to the class definition, just to the data. And old data would still be valid. Of course there would be more work downstream to properly roll-up and consume the data when calculating averages since all the indicators would be implicitly defined rather than be part of the class. We don't have a strong feeling one way or the other on this one, but we wanted to raise it in light of the likelihood that both schema and data will change over time as the tool evolves.

BHoM / EmbodiedSuffering_Toolkit

How to manage updates to data over time #19