Onix Workflow Refactor

This refactor addresses a critical issue with the Onix Workflow that prevents scaling with additional data partners: The current data partners are hardcoded throughout the workflow.

The approach that this PR takes is to split each of the SQL queries and JSON schemas into fragments that are unique to each data partner. Since the data partners are supplied to the workflow at runtime, it can decide which fragments to include and can determine what the table schemas should look like. This approach naturally results in significantly more files. The aforementioned strategy is applied to the book product table and several of the export tables.

The data partner functionality has been extended such that each partner knows which export files it contributes to. The filenames of the component (.sql and .json) files are stored in the new DataPartnerFiles class.

Queries are built at runtime by utilising Jinja's include statement to inject the required components. The template is rendered in python and is supplied with the list of data partners that the workflow is initiated with. Since many of the table schemas are no longer predetermined, they too are generated (pythonically) at runtime.

Workflow cleanup

The ONIX workflow is by far the most complicated workflow in the repository. This is a necessity as there are many tasks that need to be performed and each have a dependency on one another. Some steps have been taken to make the workflow a little simpler and easier to work with.

Task grouping

Tasks have been grouped according to their functionality. The intermediate table creation and export table creation were prime use cases for Airflow's grouping capability.

Export table removals

Some of the data export tables have been removed as they're not being used:

book_product_publisher_metrics
book_product_subject_year_metrics
book_product_year_metrics
unmatched_book_metrics

Data QA removal

As per @kathrynnapier the data QA tasks have been removed. This should alleviate some of the complexities of the workflow.

Table updates

The book product and several of the export tables (country, author, book_metrics, suject) have updated schemas. Their schemas will now reflect only the data partners that the publisher uses.

Table naming

The export tables and their respective SQL files have been renamed for consistency.

Old file names (.sql.jinja)	New file names (.sql.jinja)	Old table names (oaebu{publisher}{name})	New table names ({publisher}_{name})
export_book_author_metrics	book_metrics_author	book_product_author_metrics	book_metrics_author
export_book_list	book_list	book_product_list	book_list
export_book_metrics	book_metrics	book_product_metrics	book_metrics
export_book_metrics_city	book_metrics_city	book_product_metrics_city	book_metrics_city
export_book_metrics_country	book_metrics_country	book_product_metrics_country	book_metrics_country
export_book_metrics_event	book_metrics_events	book_product_metrics_events	book_metrics_events
export_book_metrics_institution	book_metrics_institution	book_product_metrics_institution	book_metrics_institution
export_book_publisher_metrics	deleted	book_product_publisher_metrics	deleted
export_book_subject_bic_metrics	book_metrics_subject_bic	book_product_subject_bic_metrics	book_metrics_subject_bic
export_book_subject_bisac_metrics	book_metrics_subject_bisac	book_product_subject_bisac_metrics	book_metrics_subject_bisac
export_book_subect_thema_metrics	book_metrics_subject_thema	book_product_subject_thema_metrics	book_metrics_subject_thema
export_book_subject_year_metrics	deleted	book_product_year_metrics	deleted
export_insitution_list	book_institution_list	institution_list	book_institution_list
export_unmatched_metrics	deleted	unmatched_book_metrics	deleted

Constituent table files

The data partner-specific sql and schema files require a specific naming convention to support consistency and coherency. This was particularly difficult as the sql files are further broken down into sections in some cases. I have landed on the following naming conventions:

SQL files

Purpose	File name ({name}_{partner})	Extension
country	book_metrics_country_body	.sql.jinja2
country	book_metrics_country_join	.sql
country	book_metrics_country_struct	.sql
country	book_metrics_country_null	.sql
book product	book_product_body	.sql.jinja2
book_product	book_product_functions	.sql
month metrics	month_metrics_sum	.sql
month null assertion	month_null	.sql
book metrics	book_metrics	.sql

Schema files

Purpose	File name ({name}_{partner})
book product	book_product_metrics
book product	book_product_metadata
book metrics export	book_metrics
author export	book_metrics_author
country export	book_metrics_country
subject export	book_metrics_subject

The-Academic-Observatory / oaebu-workflows

ONIX WF Refactor - Modular Product Tables #207