cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

Verify and repair URLs in ImageStore class #464

Closed mathemancer closed 4 years ago

mathemancer commented 4 years ago

Fixes

Related to #373 by @mathemancer

Description

Many of the problems with data from Common Crawl are related to faulty license URLs or license pairs ending up in our DB. This PR adds functionality to check and repair license URLs, then derive license pairs from the repaired URLs. It also has functionality to validate license pairs by ensuring that they can be mapped to a unique, valid license URL. With the merging of this PR, we will always store a valid (at the time of storage) license URL, as well as the associated (license, license_version) pair.

This PR also includes verification of all URLs that the ImageStore class stores (at a less stringent level than the license URLs), and upgrading them to use TLS whenever possible.

Technical details

The verification of license URLs uses actually calling the URLs and storing the URL to which the request is redirected. This is in the hopes that the end of the redirection chain is the 'canonical' license URL for a given license.

The verification of TLS is at the subdomain granularity.

Tests

There are new tests to cover the new functionality. Also, the reviewer is welcome to use the README as always to see the new machinery in action.

Checklist

- [X] My pull request has a descriptive title (not a vague title like `Update index.md`). - [X] My pull request targets the *default* branch of the repository (`main` or `master`). - [X] My commit messages follow [best practices][best_practices]. - [X] My code follows the established code style of the repository. - [X] I added tests for the changes I made (if applicable). - [X] I added or updated documentation (if applicable). - [X] I tried running the project locally and verified that there are no visible errors. [best_practices]:https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53 ## Developer Certificate of Origin
Developer Certificate of Origin ``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```