cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

Check Wikimedia Commons objects for mediatype before storing #453

Closed mathemancer closed 4 years ago

mathemancer commented 4 years ago

Fixes

Related To #438 by @aldenstpage

Description

The bulk of the problems described in #438 are due to the fact that we were not double-checking the media types of objects from Wikimedia Commons when retrieving their metadata. This PR changes the request we make to get that information (the mediatype), and uses the info to decide whether or not to store metadata about that object.

There are also two minor clean up changes included:

  1. Moved logging initialization so that it's avoided when wikimedia_commons is imported instead of run as a script.
  2. Changed script to use new total_images property of the ImageStore class for easier understanding.

We cannot call #438 solved until we have:

  1. Cleaned the DB after this PR is merged and deployed
  2. Come up with a more robust, general way to try to keep non-image objects' metadata out of the image table.

Technical details

Tests

There are new tests covering the functionality.

Also, the reviewer may (should they so desire) use the README to set up the development environment, and run

python dags/provider_api_scripts/wikimedia_commons.py --date 2015-03-27

You shouldn't see any non-image objects in the local PostgreSQL after running that, but if you run the same from master, this will put metadata about a number of .ogg files (audio) in the image table of the local PostgreSQL DB.

Checklist

- [X] My pull request has a descriptive title (not a vague title like `Update index.md`). - [X] My pull request targets the `master` branch of the repository. - [X] My commit messages follow [best practices][best_practices]. - [X] My code follows the established code style of the repository. - [X] I added tests for the changes I made (if applicable). - [ ] ~I added or updated documentation (if applicable).~ - [X] I tried running the project locally and verified that there are no visible errors. [best_practices]:https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53 ## Developer Certificate of Origin
Developer Certificate of Origin ``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```