Many of the problems with data from Common Crawl are related to faulty license URLs or license pairs ending up in our DB. This PR adds functionality to check and repair license URLs, then derive license pairs from the repaired URLs. It also has functionality to validate license pairs by ensuring that they can be mapped to a unique, valid license URL. With the merging of this PR, we will always store a valid (at the time of storage) license URL, as well as the associated (license, license_version) pair.
This PR also includes verification of all URLs that the ImageStore class stores (at a less stringent level than the license URLs), and upgrading them to use TLS whenever possible.
Technical details
The verification of license URLs uses actually calling the URLs and storing the URL to which the request is redirected. This is in the hopes that the end of the redirection chain is the 'canonical' license URL for a given license.
The verification of TLS is at the subdomain granularity.
Tests
There are new tests to cover the new functionality. Also, the reviewer is welcome to use the README as always to see the new machinery in action.
Checklist
- [X] My pull request has a descriptive title (not a vague title like `Update
index.md`).
- [X] My pull request targets the *default* branch of the repository (`main` or `master`).
- [X] My commit messages follow [best practices][best_practices].
- [X] My code follows the established code style of the repository.
- [X] I added tests for the changes I made (if applicable).
- [X] I added or updated documentation (if applicable).
- [X] I tried running the project locally and verified that there are no
visible errors.
[best_practices]:https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53
## Developer Certificate of Origin
Developer Certificate of Origin
```
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
```
Fixes
Related to #373 by @mathemancer
Description
Many of the problems with data from Common Crawl are related to faulty license URLs or license pairs ending up in our DB. This PR adds functionality to check and repair license URLs, then derive license pairs from the repaired URLs. It also has functionality to validate license pairs by ensuring that they can be mapped to a unique, valid license URL. With the merging of this PR, we will always store a valid (at the time of storage) license URL, as well as the associated
(license, license_version)
pair.This PR also includes verification of all URLs that the ImageStore class stores (at a less stringent level than the license URLs), and upgrading them to use TLS whenever possible.
Technical details
The verification of license URLs uses actually calling the URLs and storing the URL to which the request is redirected. This is in the hopes that the end of the redirection chain is the 'canonical' license URL for a given license.
The verification of TLS is at the subdomain granularity.
Tests
There are new tests to cover the new functionality. Also, the reviewer is welcome to use the README as always to see the new machinery in action.
Checklist
- [X] My pull request has a descriptive title (not a vague title like `Update index.md`). - [X] My pull request targets the *default* branch of the repository (`main` or `master`). - [X] My commit messages follow [best practices][best_practices]. - [X] My code follows the established code style of the repository. - [X] I added tests for the changes I made (if applicable). - [X] I added or updated documentation (if applicable). - [X] I tried running the project locally and verified that there are no visible errors. [best_practices]:https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53 ## Developer Certificate of OriginDeveloper Certificate of Origin
``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```