Closed keegansmith21 closed 8 months ago
Attention: 2 lines
in your changes are missing coverage. Please review.
Comparison is base (
c34894d
) 93.88% compared to head (69af56f
) 94.33%.
Files | Patch % | Lines |
---|---|---|
oaebu_workflows/onix_workflow/onix_workflow.py | 98.94% | 0 Missing and 2 partials :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Onix Workflow Refactor
This refactor addresses a critical issue with the Onix Workflow that prevents scaling with additional data partners: The current data partners are hardcoded throughout the workflow.
The approach that this PR takes is to split each of the SQL queries and JSON schemas into fragments that are unique to each data partner. Since the data partners are supplied to the workflow at runtime, it can decide which fragments to include and can determine what the table schemas should look like. This approach naturally results in significantly more files. The aforementioned strategy is applied to the book product table and several of the export tables.
The data partner functionality has been extended such that each partner knows which export files it contributes to. The filenames of the component (.sql and .json) files are stored in the new DataPartnerFiles class.
Queries are built at runtime by utilising Jinja's include statement to inject the required components. The template is rendered in python and is supplied with the list of data partners that the workflow is initiated with. Since many of the table schemas are no longer predetermined, they too are generated (pythonically) at runtime.
Workflow cleanup
The ONIX workflow is by far the most complicated workflow in the repository. This is a necessity as there are many tasks that need to be performed and each have a dependency on one another. Some steps have been taken to make the workflow a little simpler and easier to work with.
Task grouping
Tasks have been grouped according to their functionality. The intermediate table creation and export table creation were prime use cases for Airflow's grouping capability.
Export table removals
Some of the data export tables have been removed as they're not being used:
Data QA removal
As per @kathrynnapier the data QA tasks have been removed. This should alleviate some of the complexities of the workflow.
Table updates
The book product and several of the export tables (country, author, book_metrics, suject) have updated schemas. Their schemas will now reflect only the data partners that the publisher uses.
Table naming
The export tables and their respective SQL files have been renamed for consistency.
Constituent table files
The data partner-specific sql and schema files require a specific naming convention to support consistency and coherency. This was particularly difficult as the sql files are further broken down into sections in some cases. I have landed on the following naming conventions:
SQL files
Schema files