cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

Retrieve sub providers within Smithsonian #455

Closed ChariniNana closed 4 years ago

ChariniNana commented 4 years ago

Fixes

Fixes #454 by @ChariniNana, Related to #392, Related to #451

Description

This addresses the requirement of retrieving all sub providers within Smithsonian. There are two aspects to this requirement which are as follows:

Retrieve sub-providers at the API level, as and when pulling data from the Smithsonian API. Update the existing Smithsonian related information present in the database to reflect the sub-provider information

Technical details

The content of the 'unit_code' field of the Smithsonian API response helps to identify the sub providers uniquely. We maintain a mapping of the sub provider name to the 'unit_code' value(s) to help with the sub provider retrieval. The 'unit_code' value is stored as meta data in the image store.

Since our requirement is to categorise every image under unique sub providers, we expect the 'unit_code' value of each image to correspond to some sup provider in our mapping. If we happen to encounter an unknown 'unit_code' we throw an error and terminate the program execution. Since the 'unit_code' values supported by Smithsonian can change over time, we need to have a mechanism of frequently checking whether our known set of unit code values is up to date. If such a mechanism is available, we can update the unit code, sub provider mapping prior to executing Smithsonian sub-provider retrieval, and avoid raising errors. This is monitored in a seperate ticket #451

  1. At the API script level, when an image is processed, we get the sub provider corresponding to the 'unit_code' value and set the source field in the Image Store to the relevant sub provider. If the 'unit_code' is unknown we throw an error.
  2. At the DB level, we initially execute a select query to retrieve the foreign identifier and the 'unit_code' values for all images from Smithsonian where the source values are not yet updated. Next, we process the output row by row, and if the 'unit_code' value is known, we set the corresponding row's source value to the relevant sub-provider value in the DB. If the 'unit_code' value is unknown we throw an error.

The workflow smithsonian_sub_provider_update_workflow allows triggering the DB update related to Smithsonian sub-provider retrieval.

Tests

  1. API script level sub provider retrieval: The function test_process_image_data_with_sub_provider within test_smithsonian test suite checks whether the source is properly set when a sub provider from our mapping is encountered.
  2. DB level sub provider update: The function test_update_smithsonian_sub_providers within test_sql checks the successful updating of the image table.
  3. Test for the workflow created for DB sub-provider update is: test_smithsonian_dag_loads_with_no_errors within the test_sub_provider_update_workflow test suite.

Checklist

- [x] My pull request has a descriptive title (not a vague title like `Update index.md`). - [x] My pull request targets the `master` branch of the repository. - [x] My commit messages follow [best practices][best_practices]. - [x] My code follows the established code style of the repository. - [x] I added tests for the changes I made (if applicable). - [ ] ~~I added or updated documentation (if applicable).~~ - [x] I tried running the project locally and verified that there are no visible errors. [best_practices]:https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53 ## Developer Certificate of Origin
Developer Certificate of Origin ``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```