cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

Cleaner workflow parallelism #523

Closed mathemancer closed 4 years ago

mathemancer commented 4 years ago

Fixes

Fixes #522 by @mathemancer

Description

This modifies the cleaning logic to avoid failing to clean a row whenever it's missing tags or metadata (or these fields are defective). Instead, the dag cleans the rest of the fields, writes the result, and continues. This PR also turns the concurrency of the DAG down to 8, to avoid as many locking problems in the DB. Finally, in the event the DAG fails to clean a particular row, it now logs that row's identifier into a file for further analysis. It also turns the logging down for a couple of particularly noisy modules.

Tests

There are tests covering the new functionality.

Checklist

- [X] My pull request has a descriptive title (not a vague title like `Update index.md`). - [X] My pull request targets the *default* branch of the repository (`main` or `master`). - [X] My commit messages follow [best practices][best_practices]. - [X] My code follows the established code style of the repository. - [X] I added tests for the changes I made (if applicable). - [ ] ~I added or updated documentation (if applicable).~ - [X] I tried running the project locally and verified that there are no visible errors. [best_practices]:https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53 ## Developer Certificate of Origin
Developer Certificate of Origin ``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```