The-Academic-Observatory / academic-observatory-workflows

Telescopes, Workflows and Data Services for the Academic Observatory
https://academic-observatory-workflows.readthedocs.io
Apache License 2.0
16 stars 0 forks source link

Use more SQL for OA Dashboard Workflow #196

Closed jdddog closed 8 months ago

jdddog commented 8 months ago

This PR refactors the OA Dashboard workflow to use SQL rather than Pandas to produce the data for the dashboard, as this is a much easier and more efficient way to create the datasets and it uses less memory. Pandas was consuming too much memory.

The workflow now operates like this:

I thought that putting the files associated with each workflow in the same place made it easier to see and find the SQL templates, schemas and tests used by the workflow. I created a module for the workflow, put the code for the workflow in this module and created folders for the SQL templates and BigQuery schemas. See below for an example. What do you think @alexmassen-hane and @keegansmith21?

image

Some of the helper functions such as schema_folder and sql_folder need updating, your suggestions would be welcome.

We would need to update the other workflows as well, although that could just be done over time.

codecov[bot] commented 8 months ago

Codecov Report

Attention: 32 lines in your changes are missing coverage. Please review.

Comparison is base (31bc1e2) 94.51% compared to head (b260abd) 94.17%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #196 +/- ## ========================================== - Coverage 94.51% 94.17% -0.34% ========================================== Files 22 22 Lines 5707 5411 -296 Branches 787 749 -38 ========================================== - Hits 5394 5096 -298 Misses 181 181 - Partials 132 134 +2 ``` | [Files](https://app.codecov.io/gh/The-Academic-Observatory/academic-observatory-workflows/pull/196?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory) | Coverage Δ | | |---|---|---| | [...workflows/oa\_dashboard\_workflow/institution\_ids.py](https://app.codecov.io/gh/The-Academic-Observatory/academic-observatory-workflows/pull/196?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-YWNhZGVtaWNfb2JzZXJ2YXRvcnlfd29ya2Zsb3dzL3dvcmtmbG93cy9vYV9kYXNoYm9hcmRfd29ya2Zsb3cvaW5zdGl0dXRpb25faWRzLnB5) | `100.00% <ø> (ø)` | | | [academic\_observatory\_workflows/config.py](https://app.codecov.io/gh/The-Academic-Observatory/academic-observatory-workflows/pull/196?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-YWNhZGVtaWNfb2JzZXJ2YXRvcnlfd29ya2Zsb3dzL2NvbmZpZy5weQ==) | `92.00% <90.00%> (-8.00%)` | :arrow_down: | | [academic\_observatory\_workflows/wikipedia.py](https://app.codecov.io/gh/The-Academic-Observatory/academic-observatory-workflows/pull/196?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-YWNhZGVtaWNfb2JzZXJ2YXRvcnlfd29ya2Zsb3dzL3dpa2lwZWRpYS5weQ==) | `87.50% <79.16%> (-3.27%)` | :arrow_down: | | [...ows/oa\_dashboard\_workflow/oa\_dashboard\_workflow.py](https://app.codecov.io/gh/The-Academic-Observatory/academic-observatory-workflows/pull/196?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-YWNhZGVtaWNfb2JzZXJ2YXRvcnlfd29ya2Zsb3dzL3dvcmtmbG93cy9vYV9kYXNoYm9hcmRfd29ya2Zsb3cvb2FfZGFzaGJvYXJkX3dvcmtmbG93LnB5) | `93.58% <93.58%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jdddog commented 8 months ago

Thanks Jamie. I like the change to the directory structure for the workflow. Just minor nit comments on the PR.

Perhaps we could change the schema and sql functions to something like this? Or perhaps just do a check if the workflow folder exists and explicitly use that.

def schema_folder(workflow_name: Optional[str] = None) -> str:
    """Return the path to the schema folder.

    :param workflow_name: Optional, name of the workflow. Only to be included if the schema for the workflow is in
        the directory academic_observatory_workflows.workflows.{workflow_name}.schema
    :return: the path.
    """

    if workflow_name:
        module_path = f"academic_observatory_workflows.workflows.{workflow_name}.schema"
        assert os.path.exists(
            module_file_path(module_path)
        ), f"Workflow name {workflow_name} given but schema folder within the workflow path does not exist!"
        return module_file_path(module_path)

    return module_file_path("academic_observatory_workflows.database.schema")

Thanks @alexmassen-hane, I've made those changes and updated the functions in config.py: https://github.com/The-Academic-Observatory/academic-observatory-workflows/blob/0e73b20780b1587ee62046b386c951b5432191e5/academic_observatory_workflows/config.py

Also added a function in this PR that is required: https://github.com/The-Academic-Observatory/observatory-platform/pull/647