The-Academic-Observatory / oaebu-workflows

Telescopes, Workflows and Data Services for the 'Book Analytics Dashboard Project (2022-2025)', building upon the project 'Developing a Pilot Data Trust for Open Access eBook Usage (2020-2022)'
https://documentation.book-analytics.org/
Apache License 2.0
5 stars 0 forks source link

Related Products Elevation & Normalisation #194

Closed keegansmith21 closed 11 months ago

keegansmith21 commented 1 year ago

This PR was necessitated by two related quirks of the OAPEN metadata feed:

  1. A product's main identifier is often that of the print version of the product instead of the open access version. Instead, the open access version may be stored as a Related Product of the main product.
  2. The Related Products are stored in a malformed manner according to ONIX 3.0 specifications. This was resulting in all related products but 1 being dropped during the parsing stage of transformation.

To fix these two abnormalities I have implemented the following solutions:

1. Related Product Elevation

For every RelatedProduct in the metadata feed, a new product is created that has this related product as its main identifier, and swaps out the RP identifier for that of the main product. This step is technically introducing fake metadata, so we only want to do it when necessary and definitely should not be applied to all publishers.

2. Related Product Normalisation

ONIX 3.0 specifies that each RelatedProduct should have only one ProductIdentifier. There is an exception to be made when there are identical ProductIdentifier elements with different ProductIDType values. This step will pull out unnecessary ProductIdentifier elements into their own RelatedProduct elements to conform with specification. This step should also only be necessary for select sources and not implemented into all ONIX transformations.

ONIX Transform Consolidation

With the addition of several new functions and steps in the oapen_metadata_telescope's transform step, the state of the transform code had become messy. I have renamed the onix.py to onix_utils.py and its purpose is to store functions relating to the handling of metadata for the telescopes. I have also moved many of the functions that were in oapen_metadata_telescope.py to this file.

To deal with the growing transform steps for oapen metadata, I have created the OnixTransformer class (in onix_utils). This class has only a single function made for use - the transform function. When called, this will execute the transform stage based on the configuration of the class instantiated during object initialisation. This class can also be used for the other ONIX transform steps (onix_telescope.py and thoth.py). I have not implemented this change as the scope of this PR is already quite extensive, but it would be good to make use of it in future.

codecov[bot] commented 12 months ago

Codecov Report

Attention: 51 lines in your changes are missing coverage. Please review.

Comparison is base (fe25376) 94.44% compared to head (b513765) 93.29%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #194 +/- ## ========================================== - Coverage 94.44% 93.29% -1.16% ========================================== Files 16 16 Lines 2378 2641 +263 Branches 311 376 +65 ========================================== + Hits 2246 2464 +218 - Misses 71 95 +24 - Partials 61 82 +21 ``` | [Files](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/194?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory) | Coverage Δ | | |---|---|---| | [oaebu\_workflows/workflows/onix\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/194?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9vbml4X3RlbGVzY29wZS5weQ==) | `93.22% <100.00%> (ø)` | | | [oaebu\_workflows/workflows/thoth\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/194?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy90aG90aF90ZWxlc2NvcGUucHk=) | `97.67% <100.00%> (ø)` | | | [...bu\_workflows/workflows/oapen\_metadata\_telescope.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/194?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL3dvcmtmbG93cy9vYXBlbl9tZXRhZGF0YV90ZWxlc2NvcGUucHk=) | `98.00% <84.61%> (+0.87%)` | :arrow_up: | | [oaebu\_workflows/onix\_utils.py](https://app.codecov.io/gh/The-Academic-Observatory/oaebu-workflows/pull/194?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The-Academic-Observatory#diff-b2FlYnVfd29ya2Zsb3dzL29uaXhfdXRpbHMucHk=) | `87.90% <87.90%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.