The-Academic-Observatory / observatory-platform

Observatory Platform Package
https://docs.observatory.academy
Apache License 2.0
18 stars 5 forks source link

Design Proposal: Dataset Organisation in BiqQuery for the Academic Observatory #196

Closed rhosking closed 4 years ago

rhosking commented 4 years ago

For everything below, I will use 'academic-observatory' as the GCP project name. However, as the observatory platform is designed to be redeployed by anyone, so the project name will change. The table and dataset organisation however is still applicable. This also applies to different deployments, such as:

The following principles will guide the data organisation:

An example of this in practice could look like:

(edited) In the last 3 examples, 'academic_observatory', represents a specific hosted version of the observatory platform. For others who host this platform, they would create namespaces that represent their organisation, and possibly import data into their system to replicate the 'academic_observatory' data source. For example

@cameronneylon @jdddog @aroelo @bechandcock

rhosking commented 4 years ago

After a discussion with @cameronneylon this morning, I need to add two additional principles to the list:

aroelo commented 4 years ago
  • academic-observatory:crossref.crossref_metadata20200305 (crossref is the provider, metadata is the dataset, 20200305 is the date)

It might be a little bit confusing with the bigquery terminology referring to 'dataset' and us referring to 'dataset'. In your example 'crossref', the provider, would be the bigquery dataset. Probably everyone understands what you mean, but just thought I would note.

Anyway, sounds like a good design proposal overall!

I like the idea of using labels as well, didn't know this was possible.

jdddog commented 4 years ago

This proposal sounds good. I have a few comments.

  • (edited) The use of labels (https://cloud.google.com/bigquery/docs/labels-intro) for all datasets, from raw data coming in from telescopes to any derived datasets. This will become one of our key pillars of tracking data provenance, particularly for derived datasets where it is essentially the track which snapshots or date partitions where used as input sources. Equally, capturing the software version used to create the data, alongside any execution id's to link back to logs/audit trails

We should be able to add these automatically to the datasets with the DAGs. At the moment the DAGs add a description for the dataset as well.

  • academic-observatory:academic_observatory.countries20200410 (the observatory itself is the source of the derived dataset, countries is the dataset, and 20200410 is the date)
  • academic-observatory:academic_observatory.institutions20200410 (the observatory itself is the source of the derived dataset, institutions is the dataset, and 20200410 is the date)

Could we use shorter names, e.g. 'observatory' instead of 'academic_observatory'?

  • academic-observatory:academic_observatory_workflows.unpaywall_processed20200410 (the observatory itself is the source of the derived dataset, unpaywall_processed is the dataset, and 20200410 is the date)

What about calling the dataset 'processed' and removing '_processed' from each table name? Or calling it observatory_processed and removing '_processed' from each table name to make it more succinct?

bechandcock commented 4 years ago

This looks like a good design, especially around that concept of the telescopes having minimal processing. One small addition is to be explicit about the date naming, i.e. YYYYMMDD as there are variants used globally.

jdddog commented 4 years ago

@rhosking can we close this one?

rhosking commented 4 years ago

yes, closing, thanks