Open obulat opened 1 year ago
API Developer Docs Preview: Ready
https://wordpress.github.io/openverse-api/_preview/1126
Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.
You can check the GitHub pages deployment action list to see the current status of the deployments.
Signed-off-by: Olga Bulat obulat@gmail.com
Fixes
Fixes #[issue number] by @[issue author]
Description
Proof-of-concept of saving the data during weekly data refresh as a preparation step for data normalization.
Data refresh image cleanup steps:
http
orhttps
protocol to URLs that don't have a scheme in "url", "creator_url", "foreign_landing_url" fields"provider": "clarifai"
) withconfidence
level below TAG_MIN_CONFIDENCE = 0.90This PR also adds a Wikimedia title cleanup step that removes
File:
prefix and file extension suffix from the image title. This step was added because in the Openverse Inserter PR it was specifically pointed out that those titles are bad for UX.There is also a step that we need to add to the cleanup process for incorrect utf-8 tags, but I think we should add it in a later refresh (gist with the implementation) so as the cleanup step does not become much longer.
This PR saves one file per cleaned field in a tsv format. The files contain the image identifier and the cleaned data. I don't know where the best place to save them is.
Testing Instructions
Rename
sample_data/sample_images_to_clean.csv
tosample_data/sample_images.csv
and runjust recreate
(orjust start
->just init
, if you haven't run the API before). You should see thetsv
files recreated, logging about the cleaned fields:Checklist
Update index.md
).main
) or a parent feature branch.[best_practices]: https://git-scm.com/book/en/v2/Distributed-Git-Contributing-to-a-Project#_commit_guidelines
Developer Certificate of Origin
Developer Certificate of Origin
``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```