The-Academic-Observatory / observatory-platform

Observatory Platform Package
https://docs.observatory.academy
Apache License 2.0
18 stars 5 forks source link

Design requirement: Assurance of Data Segregation between workflows #352

Closed rhosking closed 3 years ago

rhosking commented 3 years ago

Capturing the overarching design requirement, though work may be spread across a number of other issues. The requirement is around an single airflow environment, supporting multiple workflows which unlike the existing cases, which use shared buckets and a single GCP BigQuery project, to allow for DAGs/SubDags to support explicitly segregating data into multiple projects

Roughly speaking, the following should be easily supported when writing new Telescopes:

rhosking commented 3 years ago

Digging into the design goals a little further, motivated by a few use cases that we broadly should support. Consider the following situation:

There is an airflow instance running, as it stands it can support a wide number of DAGs running on a schedule. Collectively, these DAGs all currently use the same bucket, VM (if required) and GCP project for loading BigQuery Datasets. For many public datasets this is perfectly fine. However, this same Airflow instance might be required to fetch both public and private datasets, with the private datasets subject to a number of restrictions (both commercially, ethically and from a range of data protection legislation).

The requirement is for DAG/Telescope authors, to have a clearly defined set of mechanisms that support writing data ingestion workflows that can optionally:

The overarching purpose behind these requirements is to support data segregation, and thus the ability to define access controls across each set of private data independently.

EDIT: Additionally, it might be worth considering building in functionality to specify data locations for the buckets/VMs/BigQuery datasets. it get complicated if you are looking to join datasets across regions though, hence why we have been trying to work in a single region/multi-region thus far

jdddog commented 3 years ago

To enable this we would need to create a REST API and database to store all of the information specific to a particular telescope and workflow, which allows:

We may also want to update to Airflow 2.0, so that we can use TaskGroups instead of SubDags, because they have a better UI and overcome some of the limitations of SubDags: https://airflow.apache.org/docs/apache-airflow/stable/concepts.html#taskgroup. We might for instance, want to have a TaskGroup for each project that gets processed, like what Tuan did with the WoS and Scopus telescopes.

rhosking commented 3 years ago

I think that sounds sensible, subject to working through the details. The other two things that need considering is the work around telescope templates (#326 ) and the prototype work that is being done for an API (#331 )

rhosking commented 3 years ago

The following telescopes are currently impacted by, and are waiting for decisions made on this issue: