Open stacimc opened 7 months ago
Adding myself as project lead at least for now. As discussed at the priority meeting, we will move forward with making the proposal and at least starting work on the implementation plan, with the understanding that implementation may not move forward if this project is discovered to be more complicated than we believe it will be.
Updated the language from "data refresh server" to "ingestion server" to reflect the current name of the service, in order to be less confusing. We had experimented with renaming the ingestion server in some places in documentation; if this project proceeds we won't have to worry about this confusion anymore regardless :)
Work on the project proposal is underway, as is some early investigation into the feasibility of the project.
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
The project proposal was approved, and the IP is awaiting its second approval.
The implementation plan has been merged and approved, and issues created under the milestone.
@stacimc I've moved this to "On Hold" while we wait to determine how to prioritize it; whether we start work on this project or aim to complete others first.
After discussing with @zackkrida we've decided to move this into In Progress
and kick off implementation. I'll be picking up the first issue today 🥳
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
This project is underway. The first major PR is up for review, with work in progress on the indexer worker image. There is also considerable progress on the infrastructure side with preparing for the indexer workers.
Noting that https://github.com/WordPress/openverse-infrastructure/pull/871 has been merged and the indexer worker pools are now available!
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Progress was delayed due to AFK. The indexer worker has been drafted here and will be up for review this week.
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
The indexer worker is fully implemented and we're moving on to the last few implementation issues. One additional issue was added to the milestone after discussion on #4464.
After a conversation with @sarayourfriend yesterday, I'm considering removing the use of the autoscaling group and going back to a plan very similar to my original proposal in the IP for this project, allowing Airflow to directly manage the EC2 instances. The ASG has caused a few problems with error handling and, as noted in inline comments in the PR for implementing the distributed reindex, with retrying individual workers (one of the stated goals of the project, and a situation we've run into in recent memory).
This is not yet definitive and I will make a PR to change the implementation plan since this is a big enough change/I want to get approval. However I'll note that this would not invalidate any work that's been done so far, except for a few in-progress changes in the linked PR (which need work either way!), and removing just the ASG on the infrastructure side.
Great! I'll look out for the IP change (please ping me there) and will get the infra side ready for you, it should be a simple enough change, with some room for a small refactor I've been wanting to do for a while now (extract the launch template and security group creation out into a separate module, so that one-off instances like the bastion can use it instead of the old user-data approach. Anyway, it should be a quick one to implement on the infra side and unblock live testing with staging.
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
This project was considerably delayed due to a sequence of AFK and assigned support work/meetups. Work has now been resumed. Importantly the following have been merged:
The final large chunk of implementation, to add the remaining steps, is in progress. Afterward there will be a few cleanup PRs and small pieces, but the major piece of work left will be integration tests.
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
This project is code complete and has just been postponed awaiting time to rigorously test the new data refresh in staging. Any issues discovered during testing will then have to be addressed, and the project will be kept open for an extended period of time until several production runs have been completed before we retire the old dags.
Hi @stacimc, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
After tackling a number of small issues encountered during testing, the staging audio data refresh has been run successfully on the production Airflow instance, and the staging image data refresh is underway!
Description
We presently orchestrate our data refresh process with Airflow, but the operations themselves occur on a bespoke ingestion server that’s difficult to maintain, troublesome to deploy, and not nearly as robust to failures.
For this project, we will move the operational pieces of the data refresh into Airflow.
Documents
Issues
Milestones
Prior Art