Design Proposal: Dataset Organisation in BiqQuery for the Academic Observatory

rhosking commented 4 years ago

For everything below, I will use 'academic-observatory' as the GCP project name. However, as the observatory platform is designed to be redeployed by anyone, so the project name will change. The table and dataset organisation however is still applicable. This also applies to different deployments, such as:

'academic-observatory-dev'
'academic-observatory-dev-feature-x'
'academic-observatory-staging'
'academic-observatory'

The following principles will guide the data organisation:

Data is grouped by the original provider, by creating a BigQuery Dataset for each provider.
Data that is provided or gathered as snapshots will be saved to a table of the same name each time, with the snapshot date appended. For example 'academic-observatory:grid.grid20200315'. This enables BigQuery to automatically group all common snapshots, and offer a UI with a drop-down list of snapshot dates (as mentioned in #180 ).
Where data is logically organised by date, use partitioning in BigQuery to enable efficient queries
Where data is logically organised into units linked to an identifier (grid,doi,isbn, etc) look at using clustering to enable efficient queries
(edited) The use of labels (https://cloud.google.com/bigquery/docs/labels-intro) for all datasets, from raw data coming in from telescopes to any derived datasets. This will become one of our key pillars of tracking data provenance, particularly for derived datasets where it is essentially the track which snapshots or date partitions where used as input sources. Equally, capturing the software version used to create the data, alongside any execution id's to link back to logs/audit trails
(edited) The principal of telescopes capturing the input data as accurately as possible, and where possible not changing the input data schema at all. In many occasions this is not always possible, and when this is the case extra documentation should be created and a little (or ideally no) opinionated transformations occurs that reduce the accuracy of the data. This aligns broadly with the idea of a data lake. Further analytical workflows will of course make a range of transformation and aggregations

An example of this in practice could look like:

academic-observatory:crossref.crossref_metadata20200305 (crossref is the provider, metadata is the dataset, 20200305 is the date)
academic-observatory:crossref.crossref_metadata20200205 (crossref is the provider, metadata is the dataset, 20200205 is the date)
academic-observatory:crossref.crossref_events (crossref is the provider, events is the dataset, and this will be a time partioned cluster based on event time, and clustered on either DOI or the event type)
academic-observatory:mag.authors20200410 (mag - microsoft-academic - is the provider, authors is the dataset, and 20200410 is the date)
academic-observatory:academic_observatory.countries20200410 (the observatory itself is the source of the derived dataset, countries is the dataset, and 20200410 is the date)
academic-observatory:academic_observatory.institutions20200410 (the observatory itself is the source of the derived dataset, institutions is the dataset, and 20200410 is the date)
academic-observatory:academic_observatory_workflows.unpaywall_processed20200410 (the observatory itself is the source of the derived dataset, unpaywall_processed is the dataset, and 20200410 is the date)

(edited) In the last 3 examples, 'academic_observatory', represents a specific hosted version of the observatory platform. For others who host this platform, they would create namespaces that represent their organisation, and possibly import data into their system to replicate the 'academic_observatory' data source. For example

university-of-europe:academic_observatory.countries20200410 ('university-of-europe' being the project name chosen by that organisation running an instance of the observatory, 'academic_observatory' being the source of the data, countries being the dataset, with 20200401 being the date)

@cameronneylon @jdddog @aroelo @bechandcock

rhosking commented 4 years ago

After a discussion with @cameronneylon this morning, I need to add two additional principles to the list:

The use of labels (https://cloud.google.com/bigquery/docs/labels-intro) for all datasets, from raw data coming in from telescopes to any derived datasets. This will become one of our key pillars of tracking data provenance, particularly for derived datasets where it is essentially the track which snapshots or date partitions where used as input sources. Equally, capturing the software version used to create the data, alongside any execution id's to link back to logs/audit trails
The principal of telescopes capturing the input data as accurately as possible, and where possible not changing the input data schema at all. In many occasions this is not always possible, and when this is the case extra documentation should be created and a little (or ideally no) opinionated transformations occurs that reduce the accuracy of the data. This aligns broadly with the idea of a data lake. Further analytical workflows will of course make a range of transformation and aggregations

aroelo commented 4 years ago

academic-observatory:crossref.crossref_metadata20200305 (crossref is the provider, metadata is the dataset, 20200305 is the date)

It might be a little bit confusing with the bigquery terminology referring to 'dataset' and us referring to 'dataset'. In your example 'crossref', the provider, would be the bigquery dataset. Probably everyone understands what you mean, but just thought I would note.

Anyway, sounds like a good design proposal overall!

I like the idea of using labels as well, didn't know this was possible.

jdddog commented 4 years ago

This proposal sounds good. I have a few comments.

(edited) The use of labels (https://cloud.google.com/bigquery/docs/labels-intro) for all datasets, from raw data coming in from telescopes to any derived datasets. This will become one of our key pillars of tracking data provenance, particularly for derived datasets where it is essentially the track which snapshots or date partitions where used as input sources. Equally, capturing the software version used to create the data, alongside any execution id's to link back to logs/audit trails

We should be able to add these automatically to the datasets with the DAGs. At the moment the DAGs add a description for the dataset as well.

academic-observatory:academic_observatory.countries20200410 (the observatory itself is the source of the derived dataset, countries is the dataset, and 20200410 is the date)

academic-observatory:academic_observatory.institutions20200410 (the observatory itself is the source of the derived dataset, institutions is the dataset, and 20200410 is the date)

Could we use shorter names, e.g. 'observatory' instead of 'academic_observatory'?

academic-observatory:academic_observatory_workflows.unpaywall_processed20200410 (the observatory itself is the source of the derived dataset, unpaywall_processed is the dataset, and 20200410 is the date)

What about calling the dataset 'processed' and removing '_processed' from each table name? Or calling it observatory_processed and removing '_processed' from each table name to make it more succinct?

bechandcock commented 4 years ago

This looks like a good design, especially around that concept of the telescopes having minimal processing. One small addition is to be explicit about the date naming, i.e. YYYYMMDD as there are variants used globally.

jdddog commented 4 years ago

@rhosking can we close this one?

rhosking commented 4 years ago

yes, closing, thanks

The-Academic-Observatory / observatory-platform

Design Proposal: Dataset Organisation in BiqQuery for the Academic Observatory #196