cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

Smithsonian discrepancy fix - improve creator and description availability #476

Closed ChariniNana closed 4 years ago

ChariniNana commented 4 years ago

Fixes

Related to #397 by @annatuma

Description

With this fix, we reduce the number of missing creators and descriptions for Smithsonian data

Technical details

The content of the different fields available at the freetext -> name path were analysed to decide which of those could be used to retrieve the creator value from. The CREATOR_TYPES dictionary (contained in the Smithsonian script) was likewise extended with those selected fields such that the completeness of the creator field was improved for Smithsonian data. Unlike in the previous implementation, here we concatenate all 'creator' values with an equal importance (as indicated in the CREATOR_TYPES dictionary) to obtain the full creator value to be stored.

Similarly, the content of the different fields available at the freetext -> notes path were analysed to decide which of those could be used to retrieve the description value from. The DESCRIPTION_TYPES set (contained in the Smithsonian script) was likewise extended with those selected fields such that the completeness of the descriptions (within mete data field) was improved for Smithsonian data

Checklist

- [x] My pull request has a descriptive title (not a vague title like `Update index.md`). - [x] My pull request targets the *default* branch of the repository (`main` or `master`). - [x] My commit messages follow [best practices][best_practices]. - [x] My code follows the established code style of the repository. - [ ] ~~I added tests for the changes I made (if applicable).~~ - [ ] ~~I added or updated documentation (if applicable).~~ - [x] I tried running the project locally and verified that there are no visible errors. [best_practices]:https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53 ## Developer Certificate of Origin
Developer Certificate of Origin ``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```
ChariniNana commented 4 years ago

Numbers of missing creator and description metadata looks as follows after this implementation

                         Sub provider | No Creator | Total Images | Missing Percentage
si_national_museum_of_natural_history |     149955 |      3325259 |  4.509573539985908
               si_american_art_museum |         22 |        11570 | 0.19014693171996544
                  si_anacostia_museum |        322 |          571 |   56.3922942206655
                         si_libraries |          0 |           55 |                0.0
              si_cooper_hewitt_museum |      35433 |        65686 |  53.94300155284231
                           si_gardens |        669 |          689 |  97.09724238026125
                     si_postal_museum |       2900 |         2951 |  98.27177228058285
                  si_portrait_gallery |         65 |        12001 | 0.5416215315390384
           si_american_history_museum |        254 |         2290 | 11.091703056768559
   si_african_american_history_museum |        680 |         7544 |  9.013785790031813
              si_freer_gallery_of_art |       2932 |         3877 |  75.62548362135672
              si_air_and_space_museum |        255 |         2501 | 10.195921631347462
                si_african_art_museum |          3 |          136 | 2.2058823529411766
            si_american_indian_museum |        168 |          248 |  67.74193548387096
                  si_hirshhorn_museum |          1 |          423 | 0.2364066193853428

                         Sub provider | No Description | Total Images | Missing Percentage
si_national_museum_of_natural_history |        3224038 |      3325259 |  96.95599651034702
               si_american_art_museum |           9767 |        11570 |  84.41659464131374
                  si_anacostia_museum |            501 |          571 |  87.74080560420315
                         si_libraries |              0 |           55 |                0.0
              si_cooper_hewitt_museum |           4168 |        65686 |  6.345339950674421
                           si_gardens |              0 |          689 |                0.0
                     si_postal_museum |              2 |         2951 | 0.06777363605557438
                  si_portrait_gallery |          11466 |        12001 |  95.54203816348638
           si_american_history_museum |            215 |         2290 |  9.388646288209607
   si_african_american_history_museum |              0 |         7544 |                0.0
              si_freer_gallery_of_art |           3877 |         3877 |              100.0
              si_air_and_space_museum |            319 |         2501 | 12.754898040783686
                si_african_art_museum |              1 |          136 | 0.7352941176470589
            si_american_indian_museum |            248 |          248 |              100.0
                  si_hirshhorn_museum |            423 |          423 |              100.0

The improvements for creator values are-

The improvements for description values are-