WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
254 stars 204 forks source link

Make alter data batch size configurable by media type #5124

Closed stacimc closed 3 weeks ago

stacimc commented 3 weeks ago

Fixes

Description

By default, Airflow only allows creating a maximum of 1024 dynamic tasks at a time (this is configurable but it's not advised to raise this number). The alter_data step of the data refresh fails in the image data refresh because it tries to create batches of 100_000 records at time, but there are so many image records to process that we exceed the number of available batches (we get about 9k batches) and overload XCOMs.

This PR updates the alter_data steps to allow the batch size to be configured by media type, and updates image to use batches of 1,000,000 records. It still respects the configured DATA_REFRESH_ALTER_BATCH_SIZE environment variable as well.

Testing Instructions

Run the staging_image_data_refresh locally. You'll see that it creates batches of 1000 records; this is because we have DATA_REFRESH_ALTER_BATCH_SIZE configured to 1k locally.

Now comment that variable out in your catalog/.env. Because of #5099 you will also need to comment it out in catalog/env.template to prevent it from being added right back in. Now run ov j down -v && ov j up. Run the DAG again and you will see that the alter_data step puts all the records in one batch (because we have <1M records locally).

Checklist

[best_practices]: https://git-scm.com/book/en/v2/Distributed-Git-Contributing-to-a-Project#_commit_guidelines

Developer Certificate of Origin

Developer Certificate of Origin ``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```