Closed rhosking closed 3 years ago
Digging into the design goals a little further, motivated by a few use cases that we broadly should support. Consider the following situation:
There is an airflow instance running, as it stands it can support a wide number of DAGs running on a schedule. Collectively, these DAGs all currently use the same bucket, VM (if required) and GCP project for loading BigQuery Datasets. For many public datasets this is perfectly fine. However, this same Airflow instance might be required to fetch both public and private datasets, with the private datasets subject to a number of restrictions (both commercially, ethically and from a range of data protection legislation).
The requirement is for DAG/Telescope authors, to have a clearly defined set of mechanisms that support writing data ingestion workflows that can optionally:
The overarching purpose behind these requirements is to support data segregation, and thus the ability to define access controls across each set of private data independently.
EDIT: Additionally, it might be worth considering building in functionality to specify data locations for the buckets/VMs/BigQuery datasets. it get complicated if you are looking to join datasets across regions though, hence why we have been trying to work in a single region/multi-region thus far
To enable this we would need to create a REST API and database to store all of the information specific to a particular telescope and workflow, which allows:
We may also want to update to Airflow 2.0, so that we can use TaskGroups instead of SubDags, because they have a better UI and overcome some of the limitations of SubDags: https://airflow.apache.org/docs/apache-airflow/stable/concepts.html#taskgroup. We might for instance, want to have a TaskGroup for each project that gets processed, like what Tuan did with the WoS and Scopus telescopes.
I think that sounds sensible, subject to working through the details. The other two things that need considering is the work around telescope templates (#326 ) and the prototype work that is being done for an API (#331 )
The following telescopes are currently impacted by, and are waiting for decisions made on this issue:
Capturing the overarching design requirement, though work may be spread across a number of other issues. The requirement is around an single airflow environment, supporting multiple workflows which unlike the existing cases, which use shared buckets and a single GCP BigQuery project, to allow for DAGs/SubDags to support explicitly segregating data into multiple projects
Roughly speaking, the following should be easily supported when writing new Telescopes: