The-Academic-Observatory / oaebu-workflows

Telescopes, Workflows and Data Services for the 'Book Analytics Dashboard Project (2022-2025)', building upon the project 'Developing a Pilot Data Trust for Open Access eBook Usage (2020-2022)'
https://documentation.book-analytics.org/
Apache License 2.0
5 stars 0 forks source link

ONIX WF Refactor - Modular Product Tables #207

Closed keegansmith21 closed 8 months ago

keegansmith21 commented 10 months ago

Onix Workflow Refactor

This refactor addresses a critical issue with the Onix Workflow that prevents scaling with additional data partners: The current data partners are hardcoded throughout the workflow.

The approach that this PR takes is to split each of the SQL queries and JSON schemas into fragments that are unique to each data partner. Since the data partners are supplied to the workflow at runtime, it can decide which fragments to include and can determine what the table schemas should look like. This approach naturally results in significantly more files. The aforementioned strategy is applied to the book product table and several of the export tables.

The data partner functionality has been extended such that each partner knows which export files it contributes to. The filenames of the component (.sql and .json) files are stored in the new DataPartnerFiles class.

Queries are built at runtime by utilising Jinja's include statement to inject the required components. The template is rendered in python and is supplied with the list of data partners that the workflow is initiated with. Since many of the table schemas are no longer predetermined, they too are generated (pythonically) at runtime.

Workflow cleanup

The ONIX workflow is by far the most complicated workflow in the repository. This is a necessity as there are many tasks that need to be performed and each have a dependency on one another. Some steps have been taken to make the workflow a little simpler and easier to work with.

Task grouping

Tasks have been grouped according to their functionality. The intermediate table creation and export table creation were prime use cases for Airflow's grouping capability.

Export table removals

Some of the data export tables have been removed as they're not being used:

Data QA removal

As per @kathrynnapier the data QA tasks have been removed. This should alleviate some of the complexities of the workflow.

Table updates

The book product and several of the export tables (country, author, book_metrics, suject) have updated schemas. Their schemas will now reflect only the data partners that the publisher uses.

Table naming

The export tables and their respective SQL files have been renamed for consistency.

Old file names (.sql.jinja) New file names (.sql.jinja) Old table names (oaebu{publisher}{name}) New table names ({publisher}_{name})
export_book_author_metrics book_metrics_author book_product_author_metrics book_metrics_author
export_book_list book_list book_product_list book_list
export_book_metrics book_metrics book_product_metrics book_metrics
export_book_metrics_city book_metrics_city book_product_metrics_city book_metrics_city
export_book_metrics_country book_metrics_country book_product_metrics_country book_metrics_country
export_book_metrics_event book_metrics_events book_product_metrics_events book_metrics_events
export_book_metrics_institution book_metrics_institution book_product_metrics_institution book_metrics_institution
export_book_publisher_metrics deleted book_product_publisher_metrics deleted
export_book_subject_bic_metrics book_metrics_subject_bic book_product_subject_bic_metrics book_metrics_subject_bic
export_book_subject_bisac_metrics book_metrics_subject_bisac book_product_subject_bisac_metrics book_metrics_subject_bisac
export_book_subect_thema_metrics book_metrics_subject_thema book_product_subject_thema_metrics book_metrics_subject_thema
export_book_subject_year_metrics deleted book_product_year_metrics deleted
export_insitution_list book_institution_list institution_list book_institution_list
export_unmatched_metrics deleted unmatched_book_metrics deleted

Constituent table files

The data partner-specific sql and schema files require a specific naming convention to support consistency and coherency. This was particularly difficult as the sql files are further broken down into sections in some cases. I have landed on the following naming conventions:

SQL files

Purpose File name ({name}_{partner}) Extension
country book_metrics_country_body .sql.jinja2
country book_metrics_country_join .sql
country book_metrics_country_struct .sql
country book_metrics_country_null .sql
book product book_product_body .sql.jinja2
book_product book_product_functions .sql
month metrics month_metrics_sum .sql
month null assertion month_null .sql
book metrics book_metrics .sql

Schema files

Purpose File name ({name}_{partner})
book product book_product_metrics
book product book_product_metadata
book metrics export book_metrics
author export book_metrics_author
country export book_metrics_country
subject export book_metrics_subject
codecov[bot] commented 9 months ago

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (c34894d) 93.88% compared to head (69af56f) 94.33%.

Files Patch % Lines
oaebu_workflows/onix_workflow/onix_workflow.py 98.94% 0 Missing and 2 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #207 +/- ## ========================================== + Coverage 93.88% 94.33% +0.45% ========================================== Files 15 18 +3 Lines 2729 2771 +42 Branches 396 399 +3 ========================================== + Hits 2562 2614 +52 - Misses 79 80 +1 + Partials 88 77 -11 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.