Closed obulat closed 7 months ago
Both the rapid iteration IP (https://github.com/WordPress/openverse/issues/1985) and the staging database recreation DAG (#1989) are underway. Most of the implementation for the DAG is complete, with #2207 and #2211 being the final two pieces before that can be shipped. We'll also need to complete #1990 before we turn the DAG on for regular use.
The staging database restore DAG is complete! 🥳 I will enable it for the first time once I'm back from WCEU next week.
The implementation plan for the rapid iteration (#2133) has been merged and the associated issues (#2370, #2371, #2372) have been created. This work is necessary to perform serially but it can begin immediately!
The initial implementation plan for the 3rd portion of this project has also been published by @krysal in #2358
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
The staging database restore DAG work was completed and was enabled for the first time this week! While it completed successfully, @sarayourfriend encountered some trouble with the Terraform state which was referencing the RDS instance (in that the instance appeared to Terraform to be removed, even though a new instance was spun up). Sara corrected the state file and we believe this issue is resolved, but we'll need to monitor the next run to see if we encounter the same issue and adjust either Terraform or the DAG accordingly.
@krysal opened the final IP for this work in #2358, and we've begun discussing it there.
but we'll need to monitor the next run to see if we encounter the same issue and adjust either Terraform or the DAG accordingly.
After the subsequent runs the changes in Terraform work as expected, no further work here is needed.
Per the priorities meeting discussion that happened around this project, we'll be putting this project on hold for now. We'll continue the review process to merge #2358, but hold off on any ongoing development efforts in favor of other ongoing projects.
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
We've just pulled this project off of on-hold - the Elasticsearch rapid iteration milestone issues can be worked on once we address the API stability!
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
The work on this project has been slower due to our current effort to focus on https://github.com/WordPress/openverse/issues/3197. With that work slowing down, we should be able to start working on this project again this cycle.
Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
The only update is that we now have a PR open for creating the staging indices: #3232
While working on #3232, I went back and read the implementation plans for this project and came away with a lot of questions. It's possible some of this may be answered in comment threads on the IP prs that didn't make it into the final text, so apologies if I'm rehashing anything!
Here's a really brief summary of my understanding of the three IPs involved in this project, and the DAGs they describe:
staging_database_restore
DAG
create_new_es_index_staging
media_type
, creates a new staging elasticsearch index with the given index-suffix
, derived from the given source_index
, optionally filtering documents from the source using a given query
. It also takes an index-config
which is merged with the ES configuration of the source index (or totally overwrites it, if the override_config
param is enabled), allowing you to change the shape of the index on the fly.create_new_es_index_production
recreate_full_staging_index
<media_type>
index, and points the <media-type>-full
alias to it. If the point_alias
param is enabled, it will instead point it to the main <media_type>
alias. It also has a param that’s used to decide whether to delete the index being replaced.create_proportional_by_provider_staging_index
source_index
, but with a proportional subset of records for each provider, given by the percentage_of_prod
param.So here are my questions:
create_new_es_index_production
DAGs account for this? They currently create indices from the main es index.create_new_es_index_{environment}
dags create the indices, but don’t promote them/point any aliases. Shouldn't they? What is the plan for how to promote those indices — is that done manually, or in a separate dag not yet planned?
target_alias
: target alias to apply on new index. If empty, no alias is pointeddelete_old_index
: whether to delete an index that is being replace by this one, if applicable (i.e., the index previously pointed to by the target alias)recreate_full_staging_index
DAGs? The create_new_es_index_staging
does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.create_proportional_by_provider_staging_index
DAG (rather than add a the percentage_of_prod
param to the create_new_es_index_{environment}
DAGs, especially since that param doesn't make sense in prod. But it seems like we could intentionally share a lot of logic around creating the index and the promotion steps (only changing the logic for the actual reindexing).staging_data_refresh
DAG to run data refreshes on the staging ingestion server/against the staging api? That seems like it would be a really helpful final piece for testing (and easy to implement).I'll try my best to answer the above, @krysal may also be able to provide some additional context!
- We just updated the data refresh to create the filtered index before promoting the new media index, because we observed performance issues during filtered index creation (since it was running a reindex on a live, actively used index). How will the
create_new_es_index_production
DAGs account for this? They currently create indices from the main es index.
Part of that move was predicated by the API response time investigation project, and the motivation was as you mention to reduce pressure on the live indices while the data refresh is happening. Since we were able to reduce response times through a combination of related media query simplification and API ASGI implementation, I think we should be safe to target live indices for this DAG in the future. With the frequency of the data refresh (and the ease with which we could change the order of operations for it!), it made sense to make that change. For the more on-the-fly index generation that this new DAG was intended to enable, I'm not sure how we'd get around using the live indices (unless we decided to create a whole new index in order to create the index from that, in which case we should just create the index from the database with the new settings).
All that to say that I think with how infrequently this particular DAG is likely to run and how improved our Elasticsearch response time is, I think this should be a safe thing to do now :slightly_smiling_face:
- The
create_new_es_index_{environment}
dags create the indices, but don’t promote them/point any aliases. Shouldn't they? What is the plan for how to promote those indices — is that done manually, or in a separate dag not yet planned?
That's a great point, I like the idea of adding those new configuration options!
- What is the purpose of the
recreate_full_staging_index
DAGs? Thecreate_new_es_index_staging
does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.
As I understand it, create_new_es_index_*
only uses an existing index, whereas recreate_full_staging_index
should pull records from the database and insert them into a new index, similar to the data refresh. I believe the same holds for the proportional DAG. The former is derrived while the latter two are created fresh. With that in mind, I think it makes sense to keep them separately in my mind.
- I do think it's a good idea to separate out the
create_proportional_by_provider_staging_index
DAG (rather than add a thepercentage_of_prod
param to thecreate_new_es_index_{environment}
DAGs, especially since that param doesn't make sense in prod. But it seems like we could intentionally share a lot of logic around creating the index and the promotion steps (only changing the logic for the actual reindexing).
@sarayourfriend also mentioned this in the third IP. I also support having separate DAGs for these but as we're developing them we should keep reuse in mind!
- Should we add an issue to create a
staging_data_refresh
DAG to run data refreshes on the staging ingestion server/against the staging api? That seems like it would be a really helpful final piece for testing (and easy to implement).
That could be useful! Especially for testing changes to the data refresh process. I think one thing we'd want to add on an infrastructure level is an official DNS name to the staging data refresh server, because currently we only reference it by IP which would change every deployment.
Hopefully that helps! I'd love some other folks to weigh in, but I think it might make sense to take some of these changes back to the IPs and update them appropriately once we've desided :smile:
@stacimc You summarized the parts of these plans very well. The questions are quite valid, and I pretty much agree with the answers that @AetherUnbound has already given, I'll just add a few comments.
What is the purpose of the
recreate_full_staging_index
DAGs? Thecreate_new_es_index_staging
does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.As I understand it,
create_new_es_index_*
only uses an existing index, whereasrecreate_full_staging_index
should pull records from the database and insert them into a new index, similar to the data refresh. I believe the same holds for the proportional DAG. The former is derrived while the latter two are created fresh. With that in mind, I think it makes sense to keep them separately in my mind.
Exactly! The recreate_full_staging_index
is planned mainly to decouple the full index creation from the Data Refresh process, and to use the resulting index as the source for the create_proportional_by_provider_staging_index
and the create_new_es_index_{environment}
DAGs.
Given the complexity of this project at the beginning, it was decided to split the different forms for creating the indexes (by type, environment, and size with custom configurations) in different IPs, and therefore, in different DAGs, it was more understandable and manageable this way. Now that the requirements are clearer and there is more familiarity with ES we can see opportunities to merge the DAGs, but I'll wait to build at least some of them before doing so, because the many moving parts are still there, and will be easy to test them separately, although that's my opinion! Maybe others see it differently. I also believe the same DAGs should probably apply to both environments at the end of the day.
Thanks @krysal and @AetherUnbound, that additional context makes a ton of sense :)
The recreate_full_staging_index is planned mainly to decouple the full index creation from the Data Refresh process, and to use the resulting index as the source for the create_proportional_by_provider_staging_index and the create_new_esindex{environment} DAGs.
That's a great summary, totally clicked for me. I think the underlying IPs are great, but I was stuck on a higher-level picture of how these DAGs are practically going to be used and relate to one another. That's actually sort of split out into a separate, future project 😅 For now, it makes a lot of sense to just be really flexible in the implementation, as already described :)
Update:
I've opened a PR for the proportional index DAG which is almost ready, mostly needing a final test and then description/testing instructions.
This PR also happens to tackle much of the work needed for this issue to add a DAG for pointing aliases; once it's merged that issue can be resolved very quickly.
While implementing the proportional DAG I did notice a problem for which I filed a bug (#3761). In looking at that now I actually think it might've been easier to implement the fixed version than the original idea, so I may update the proportional index DAG PR to include that fix as well.
Summary: there are only a few issues left in the milestone, the largest of which is almost fully implemented. The remaining issues should be tackled in the next week or so.
The proportional index DAG was merged after awhile in code review, including the fix for the #3761 bug that was filed while working on it! All that remains is the point alias DAG, which is in progress and a much smaller issue.
A PR is open for the point alias DAG.
While working on it I noticed that a few of the concurrency dependencies between all the related elasticsearch DAGs had not been set up properly, or noticed in review. That is a huge pain point of having all these DAGs and very easy to miss. I prototyped an idea for handling that in a more automated way and created an issue (https://github.com/WordPress/openverse/issues/3891).
At minimum the dependencies need to be fixed; ideally, that issue is implemented and this problem is solved into the future. My plan is to timebox the larger issue and fall back to simply manually fixing the dependencies if there are any issues with the more complex approach.
The final PR for this milestone has been merged! Moving this project to Shipped
.
Per the project's stated success criteria, which is:
This project can be considered a success once any maintainer can make adjustments to the staging API database & Elasticsearch index based on the requirements described above.
I think this could be considered a success. The process docs note that a project-specific retro could be held at this stage as well, but feedback related to this project has already been discussed at the two most recent retros.
I would like to propose that this project be moved to Success
and the project thread closed, given this. @WordPress/openverse-maintainers, what do you think (and maybe specifically @AetherUnbound as the author of the proposal)? I can also wait to raise this question at the community meeting if preferred!
I would be comfortable moving this to Success
!
Excellent! Moving to Success
🎉
Description
Modify the API staging environment to include a proportional subset of media from each provider. Increase the frequency of data refreshes.
This project does not address metrics for measuring and tracking relevancy.
To implement this project, we want to read the production
/stats/
endpoints to get the media totals for each provider, then scale these numbers down to set the ingestion limit per provider in staging.This project could also include the update of the Elasticsearch cluster (or setting up a new cluster and moving the staging there).
Documents
Project homepage
Issues
Milestones
Prior Art