The-Academic-Observatory / oaebu-workflows

Telescopes, Workflows and Data Services for the 'Book Analytics Dashboard Project (2022-2025)', building upon the project 'Developing a Pilot Data Trust for Open Access eBook Usage (2020-2022)'
https://documentation.book-analytics.org/
Apache License 2.0
5 stars 0 forks source link

Update of UCL discovery telescope #164

Closed keegansmith21 closed 1 year ago

keegansmith21 commented 1 year ago

UCL Discovery

PR Necessity

This PR is a major update of the UCL Discovery telescope. This update has been necessitated by the manual intervention that has been required to harvest the entirety of the discovery repository's data. Due to the incompleteness of discovery's metadata, as harvested from their API, this intervention has been required. Since there is no way to retrieve the entirety of the metadata, we instead opt to switch to a semi-manual approach.

Google Sheet

The metadata of each title has been, and continues to be provided to us via a contact at UCL through email interface. This is how we have maintained a complete metadata collection. This relies heavily on manual interference. As such, we have moved our current metadata collection to a Google Sheet. This presents the opportunity to both structure the data, and restrict its access to only those who would require it. Since Google Sheets have a simple interface, no substantial technical knowledge is required to enter the data (which should be done on a monthly basis). The is contrary to alternative solutions such as using the already-existing SFTP server. The telescope utilises the Python library for the Google Sheets API to extract the data. This relies on the sheet being formatted specifically and any major change to the structure of the sheet will likely require an update to the telescope.

Sheet Metadata

The metadata that is stored on the google sheet is minimal. Only a few fields are required for the telescope to function. The most imperative is the relationship between the UCL proprietary identifier, eprintID, and the universal book identifier, ISBN13. We require the eprint ID to query the UCL discovery usage API and collect the monthly usage data. We then map this to the correct ISBN13 for future data processing by the onix_workflow.

Caveats

It should be noted that the telescope's logical workflow does present a possible issue. If the stored in the sheet is incomplete at the time it's run, then the incomplete titles will be ignored and the telescope will run without collecting their data. I have elected to forgo throwing an error as there are some titles that we wish to store on the sheet that are not mapped to an ISBN, or that have not yet been published and should be ignored. In the case that the metadata is not updated and the telescope does run without all titles, we will need to manually remove the partition from the table and rerun the DAG.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.07% :tada:

Comparison is base (48ee6e4) 95.06% compared to head (4342bb6) 95.13%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #164 +/- ## =========================================== + Coverage 95.06% 95.13% +0.07% =========================================== Files 16 16 Lines 2409 2426 +17 Branches 318 316 -2 =========================================== + Hits 2290 2308 +18 + Misses 73 72 -1 Partials 46 46 ``` | [Files Changed](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/164?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory) | Coverage Δ | | |---|---|---| | [oaebu\_workflows/oaebu\_partners.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/164?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL29hZWJ1X3BhcnRuZXJzLnB5) | `100.00% <ø> (ø)` | | | [oaebu\_workflows/workflows/onix\_workflow.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/164?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9vbml4X3dvcmtmbG93LnB5) | `96.60% <100.00%> (+0.23%)` | :arrow_up: | | [...ebu\_workflows/workflows/ucl\_discovery\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/164?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy91Y2xfZGlzY292ZXJ5X3RlbGVzY29wZS5weQ==) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

keegansmith21 commented 1 year ago

@jdddog Please review when you've got some spare time. A few things to note: