The-Academic-Observatory / oaebu-workflows

Telescopes, Workflows and Data Services for the 'Book Analytics Dashboard Project (2022-2025)', building upon the project 'Developing a Pilot Data Trust for Open Access eBook Usage (2020-2022)'
https://documentation.book-analytics.org/
Apache License 2.0
5 stars 0 forks source link

Feature/workflow config file #142

Closed keegansmith21 closed 1 year ago

keegansmith21 commented 1 year ago

OAEBU Refactor

This PR is a refactor of the repository such that it is compatible with the upcoming change to the Observatory Platform.

Purpose for Refactor

The primary purpose is to heavily reduce the reliance on the observatory API for the workflows. The API's only function now is to store dataset releases. Additionally, we wanted to reduce the reliance on inheritance for telescope and release creation and remove much of the release functionality. Now, the telescopes/workflows each inherit from the base Workflow class. Additional tools and utilities are available to fill the roles that the Stream/Snapshot/Organisation telescopes held. This change was necessary as it was restrictive to have to adhere to one of these three templates. There is also no concept of an Organisation as this was found to be largely unnecessary.

Release Class

The release class for each telescope now inherits from either the PartitionRelease or SnapshotRelease depending on whether the workflow creates a snapshot or a partitioned table. The differences in these releases are entirely cosmetic. The function of the release class has been simplified to only concern itself with run-specific parameters/variables. The release class is only necessary because some aspects of the Workflows cannot be determined until runtime, so the release class should reflect only these aspects. The Workflow release classes thus tend to have little to no class functions and contain only filepaths and dates.

DAG Creation

The DAG creation is now done from the observatory-platform's load_workflows.py. Since this loads the DAGs in a standard way, there is no room to customise individual DAG loads. Instead, a .yaml config file is used to describe each individual workflow (DAG). This makes the workflow loading much more transparent. DAG IDs are supplied explicitly in the config file and passed to the Workflow constructor. This is also for the purpose of transparency. Prior to this update, DAG IDs were constructed from the DAG prefix and the Organisation - which is now deprecated.

Onix Workflow

The Onix Workflow is by far the most complex of the workflows. At present, we have only a handful of data sources. Since all data sources must be accounted for and aggregated in the Onix Workflow, an increase in our operations and data sources will quickly inflate the length and complexity of this workflow. The PR has made some attempts to mitigate the complexity and reduce volume of hard-coding. This is still rampant throughout the workflow and more effort is needed. Since this was not the primary purpose of this refactor, I have tried to limit the time I spent and have forgone any fundamental changes to the workflow. Now, it should be much easier to read and follow. This is also the case for the Onix Workflow Test, which had a large volume of Mock tests. I have elected to remove these as they appeared to emphasize testing of the implementation of the class functions, rather than their functionality.

OaebuPartner

The oaebu_partners.py file has been moved from the workflow directory to its parent directory. The OaebuPartner class has been slightly changed and the file now contains a hard-coded list of partners. This list is used as a lookup - the partner names are supplied in the new config file and passed to the onix workflow upon DAG creation.

SQL and Table ID

Some of the SQL (jinja templated) has been updated. This is to accommodate the new table format. The fully qualified table ID is passed now, rather than the separate project ID, dataset and table name. This makes both the SQL template and the python code much cleaner and easier to read. If we want access to any specific parts of the FQ table ID in the template, these can be supplied separately.

Release Date Terminology

We have updated the terminology used for the release date. In most cases, we refer to either the snapshot or partition date. These can be thought of as a more specific case of a release date. In some cases, we may still refer to these instances as a release date. We should try to be as specific as possible, but for all intents and purposes, a snapshot/parition date is a release date.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 95.87% and project coverage change: +0.61 :tada:

Comparison is base (2343f90) 94.34% compared to head (bec5419) 94.95%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #142 +/- ## =========================================== + Coverage 94.34% 94.95% +0.61% =========================================== Files 24 15 -9 Lines 2812 2379 -433 Branches 363 315 -48 =========================================== - Hits 2653 2259 -394 + Misses 75 74 -1 + Partials 84 46 -38 ``` | [Impacted Files](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory) | Coverage Δ | | |---|---|---| | [oaebu\_workflows/workflows/onix\_workflow.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9vbml4X3dvcmtmbG93LnB5) | `96.38% <ø> (+2.81%)` | :arrow_up: | | [oaebu\_workflows/airflow\_pools.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL2FpcmZsb3dfcG9vbHMucHk=) | `73.68% <73.68%> (ø)` | | | [oaebu\_workflows/config.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL2NvbmZpZy5weQ==) | `89.74% <86.20%> (-10.26%)` | :arrow_down: | | [oaebu\_workflows/workflows/onix\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9vbml4X3RlbGVzY29wZS5weQ==) | `94.26% <93.24%> (+1.00%)` | :arrow_up: | | [...aebu\_workflows/workflows/google\_books\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9nb29nbGVfYm9va3NfdGVsZXNjb3BlLnB5) | `94.26% <93.85%> (+0.05%)` | :arrow_up: | | [...\_workflows/workflows/google\_analytics\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9nb29nbGVfYW5hbHl0aWNzX3RlbGVzY29wZS5weQ==) | `90.76% <94.73%> (+4.21%)` | :arrow_up: | | [...ebu\_workflows/workflows/oapen\_irus\_uk\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9vYXBlbl9pcnVzX3VrX3RlbGVzY29wZS5weQ==) | `97.63% <95.57%> (+1.61%)` | :arrow_up: | | [...bu\_workflows/workflows/oapen\_metadata\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9vYXBlbl9tZXRhZGF0YV90ZWxlc2NvcGUucHk=) | `95.87% <96.55%> (+2.98%)` | :arrow_up: | | [oaebu\_workflows/workflows/fulcrum\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9mdWxjcnVtX3RlbGVzY29wZS5weQ==) | `99.17% <98.46%> (+12.09%)` | :arrow_up: | | [oaebu\_workflows/workflows/jstor\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9qc3Rvcl90ZWxlc2NvcGUucHk=) | `94.87% <98.49%> (+4.03%)` | :arrow_up: | | ... and [3 more](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory) | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/142/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.