Search relevancy sandbox

obulat commented 1 year ago

Start Date	ETA	Project Lead	Actual Ship Date
2023-04-01	2024-04-31	@AetherUnbound	TBD

Description

Modify the API staging environment to include a proportional subset of media from each provider. Increase the frequency of data refreshes.

This project does not address metrics for measuring and tracking relevancy.

To implement this project, we want to read the production /stats/ endpoints to get the media totals for each provider, then scale these numbers down to set the ingestion limit per provider in staging.

This project could also include the update of the Elasticsearch cluster (or setting up a new cluster and moving the staging there).

Documents

Project homepage

Issues

[x] https://github.com/WordPress/openverse/issues/1985
[x] https://github.com/WordPress/openverse/issues/1987
[x] https://github.com/WordPress/openverse/issues/1989
[x] https://github.com/WordPress/openverse/issues/1990
[x] https://github.com/WordPress/openverse/issues/2370
[x] https://github.com/WordPress/openverse/issues/2371
[x] #3232
[x] https://github.com/WordPress/openverse/issues/2372
[x] #3488
[x] #3492
[x] #3493

Milestones

[x] Rapid iteration on Elasticsearch indices: https://github.com/WordPress/openverse/milestone/16

Prior Art

AetherUnbound commented 1 year ago

Update 2023-04-12

Done

Blockers

AetherUnbound commented 1 year ago

Update 2023-05-02

Done

[x] https://github.com/WordPress/openverse/pull/1154 was merged

Both the rapid iteration IP (https://github.com/WordPress/openverse/issues/1985) and the staging database recreation DAG (#1989) are underway. Most of the implementation for the DAG is complete, with #2207 and #2211 being the final two pieces before that can be shipped. We'll also need to complete #1990 before we turn the DAG on for regular use.

AetherUnbound commented 1 year ago

The staging database restore DAG is complete! 🥳 I will enable it for the first time once I'm back from WCEU next week.

AetherUnbound commented 1 year ago

The implementation plan for the rapid iteration (#2133) has been merged and the associated issues (#2370, #2371, #2372) have been created. This work is necessary to perform serially but it can begin immediately!

The initial implementation plan for the 3rd portion of this project has also been published by @krysal in #2358

openverse-bot commented 1 year ago

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound commented 1 year ago

The staging database restore DAG work was completed and was enabled for the first time this week! While it completed successfully, @sarayourfriend encountered some trouble with the Terraform state which was referencing the RDS instance (in that the instance appeared to Terraform to be removed, even though a new instance was spun up). Sara corrected the state file and we believe this issue is resolved, but we'll need to monitor the next run to see if we encounter the same issue and adjust either Terraform or the DAG accordingly.

@krysal opened the final IP for this work in #2358, and we've begun discussing it there.

sarayourfriend commented 1 year ago

but we'll need to monitor the next run to see if we encounter the same issue and adjust either Terraform or the DAG accordingly.

After the subsequent runs the changes in Terraform work as expected, no further work here is needed.

AetherUnbound commented 1 year ago

Per the priorities meeting discussion that happened around this project, we'll be putting this project on hold for now. We'll continue the review process to merge #2358, but hold off on any ongoing development efforts in favor of other ongoing projects.

openverse-bot commented 1 year ago

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound commented 1 year ago

We've just pulled this project off of on-hold - the Elasticsearch rapid iteration milestone issues can be worked on once we address the API stability!

openverse-bot commented 1 year ago

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound commented 1 year ago

The work on this project has been slower due to our current effort to focus on https://github.com/WordPress/openverse/issues/3197. With that work slowing down, we should be able to start working on this project again this cycle.

openverse-bot commented 11 months ago

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound commented 11 months ago

The only update is that we now have a PR open for creating the staging indices: #3232

stacimc commented 11 months ago

While working on #3232, I went back and read the implementation plans for this project and came away with a lot of questions. It's possible some of this may be answered in comment threads on the IP prs that didn't make it into the final text, so apologies if I'm rehashing anything!

Here's a really brief summary of my understanding of the three IPs involved in this project, and the DAGs they describe:

Update staging DB
- staging_database_restore DAG
  - this updates the staging DB using the most recent production db snapshot. It does not affect es indices.
Rapid iteration of ingestion server index configuration
- create_new_es_index_staging
  - TL;DR: creates a brand new es index, with lots of options for configuration, but does not actually promote the index/apply any aliases. (Question about this below)
    - Longer explanation: For the given media_type, creates a new staging elasticsearch index with the given index-suffix, derived from the given source_index, optionally filtering documents from the source using a given query. It also takes an index-config which is merged with the ES configuration of the source index (or totally overwrites it, if the override_config param is enabled), allowing you to change the shape of the index on the fly.
- create_new_es_index_production
  - The exact same thing as above, but for production
Staging Elasticsearch Reindex DAGs
- recreate_full_staging_index
  - Creates a new es index in staging based on the <media_type> index, and points the <media-type>-full alias to it. If the point_alias param is enabled, it will instead point it to the main <media_type> alias. It also has a param that’s used to decide whether to delete the index being replaced.
- create_proportional_by_provider_staging_index
  - Creates a new es index in staging based on the given source_index, but with a proportional subset of records for each provider, given by the percentage_of_prod param.

So here are my questions:

We just updated the data refresh to create the filtered index before promoting the new media index, because we observed performance issues during filtered index creation (since it was running a reindex on a live, actively used index). How will the create_new_es_index_production DAGs account for this? They currently create indices from the main es index.
The create_new_es_index_{environment} dags create the indices, but don’t promote them/point any aliases. Shouldn't they? What is the plan for how to promote those indices — is that done manually, or in a separate dag not yet planned?
1. Could we not have a couple params:
  1. target_alias: target alias to apply on new index. If empty, no alias is pointed
  2. delete_old_index: whether to delete an index that is being replace by this one, if applicable (i.e., the index previously pointed to by the target alias)
What is the purpose of the recreate_full_staging_index DAGs? The create_new_es_index_staging does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.
I do think it's a good idea to separate out the create_proportional_by_provider_staging_index DAG (rather than add a the percentage_of_prod param to the create_new_es_index_{environment} DAGs, especially since that param doesn't make sense in prod. But it seems like we could intentionally share a lot of logic around creating the index and the promotion steps (only changing the logic for the actual reindexing).
Should we add an issue to create a staging_data_refresh DAG to run data refreshes on the staging ingestion server/against the staging api? That seems like it would be a really helpful final piece for testing (and easy to implement).

AetherUnbound commented 11 months ago

I'll try my best to answer the above, @krysal may also be able to provide some additional context!

We just updated the data refresh to create the filtered index before promoting the new media index, because we observed performance issues during filtered index creation (since it was running a reindex on a live, actively used index). How will the create_new_es_index_production DAGs account for this? They currently create indices from the main es index.

Part of that move was predicated by the API response time investigation project, and the motivation was as you mention to reduce pressure on the live indices while the data refresh is happening. Since we were able to reduce response times through a combination of related media query simplification and API ASGI implementation, I think we should be safe to target live indices for this DAG in the future. With the frequency of the data refresh (and the ease with which we could change the order of operations for it!), it made sense to make that change. For the more on-the-fly index generation that this new DAG was intended to enable, I'm not sure how we'd get around using the live indices (unless we decided to create a whole new index in order to create the index from that, in which case we should just create the index from the database with the new settings).

All that to say that I think with how infrequently this particular DAG is likely to run and how improved our Elasticsearch response time is, I think this should be a safe thing to do now :slightly_smiling_face:

The create_new_es_index_{environment} dags create the indices, but don’t promote them/point any aliases. Shouldn't they? What is the plan for how to promote those indices — is that done manually, or in a separate dag not yet planned?

That's a great point, I like the idea of adding those new configuration options!

What is the purpose of the recreate_full_staging_index DAGs? The create_new_es_index_staging does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.

As I understand it, create_new_es_index_* only uses an existing index, whereas recreate_full_staging_index should pull records from the database and insert them into a new index, similar to the data refresh. I believe the same holds for the proportional DAG. The former is derrived while the latter two are created fresh. With that in mind, I think it makes sense to keep them separately in my mind.

I do think it's a good idea to separate out the create_proportional_by_provider_staging_index DAG (rather than add a the percentage_of_prod param to the create_new_es_index_{environment} DAGs, especially since that param doesn't make sense in prod. But it seems like we could intentionally share a lot of logic around creating the index and the promotion steps (only changing the logic for the actual reindexing).

@sarayourfriend also mentioned this in the third IP. I also support having separate DAGs for these but as we're developing them we should keep reuse in mind!

Should we add an issue to create a staging_data_refresh DAG to run data refreshes on the staging ingestion server/against the staging api? That seems like it would be a really helpful final piece for testing (and easy to implement).

That could be useful! Especially for testing changes to the data refresh process. I think one thing we'd want to add on an infrastructure level is an official DNS name to the staging data refresh server, because currently we only reference it by IP which would change every deployment.

Hopefully that helps! I'd love some other folks to weigh in, but I think it might make sense to take some of these changes back to the IPs and update them appropriately once we've desided :smile:

krysal commented 11 months ago

@stacimc You summarized the parts of these plans very well. The questions are quite valid, and I pretty much agree with the answers that @AetherUnbound has already given, I'll just add a few comments.

What is the purpose of the recreate_full_staging_index DAGs? The create_new_es_index_staging does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.

As I understand it, create_new_es_index_* only uses an existing index, whereas recreate_full_staging_index should pull records from the database and insert them into a new index, similar to the data refresh. I believe the same holds for the proportional DAG. The former is derrived while the latter two are created fresh. With that in mind, I think it makes sense to keep them separately in my mind.

Exactly! The recreate_full_staging_index is planned mainly to decouple the full index creation from the Data Refresh process, and to use the resulting index as the source for the create_proportional_by_provider_staging_index and the create_new_es_index_{environment} DAGs.

Given the complexity of this project at the beginning, it was decided to split the different forms for creating the indexes (by type, environment, and size with custom configurations) in different IPs, and therefore, in different DAGs, it was more understandable and manageable this way. Now that the requirements are clearer and there is more familiarity with ES we can see opportunities to merge the DAGs, but I'll wait to build at least some of them before doing so, because the many moving parts are still there, and will be easy to test them separately, although that's my opinion! Maybe others see it differently. I also believe the same DAGs should probably apply to both environments at the end of the day.

stacimc commented 11 months ago

Thanks @krysal and @AetherUnbound, that additional context makes a ton of sense :)

The recreate_full_staging_index is planned mainly to decouple the full index creation from the Data Refresh process, and to use the resulting index as the source for the create_proportional_by_provider_staging_index and the create_new_esindex{environment} DAGs.

That's a great summary, totally clicked for me. I think the underlying IPs are great, but I was stuck on a higher-level picture of how these DAGs are practically going to be used and relate to one another. That's actually sort of split out into a separate, future project 😅 For now, it makes a lot of sense to just be really flexible in the implementation, as already described :)

stacimc commented 9 months ago

Update:

I've opened a PR for the proportional index DAG which is almost ready, mostly needing a final test and then description/testing instructions.

This PR also happens to tackle much of the work needed for this issue to add a DAG for pointing aliases; once it's merged that issue can be resolved very quickly.

While implementing the proportional DAG I did notice a problem for which I filed a bug (#3761). In looking at that now I actually think it might've been easier to implement the fixed version than the original idea, so I may update the proportional index DAG PR to include that fix as well.

Summary: there are only a few issues left in the milestone, the largest of which is almost fully implemented. The remaining issues should be tackled in the next week or so.

stacimc commented 8 months ago

The proportional index DAG was merged after awhile in code review, including the fix for the #3761 bug that was filed while working on it! All that remains is the point alias DAG, which is in progress and a much smaller issue.

stacimc commented 8 months ago

A PR is open for the point alias DAG.

While working on it I noticed that a few of the concurrency dependencies between all the related elasticsearch DAGs had not been set up properly, or noticed in review. That is a huge pain point of having all these DAGs and very easy to miss. I prototyped an idea for handling that in a more automated way and created an issue (https://github.com/WordPress/openverse/issues/3891).

At minimum the dependencies need to be fixed; ideally, that issue is implemented and this problem is solved into the future. My plan is to timebox the larger issue and fall back to simply manually fixing the dependencies if there are any issues with the more complex approach.

stacimc commented 7 months ago

The final PR for this milestone has been merged! Moving this project to Shipped.

stacimc commented 7 months ago

Per the project's stated success criteria, which is:

This project can be considered a success once any maintainer can make adjustments to the staging API database & Elasticsearch index based on the requirements described above.

I think this could be considered a success. The process docs note that a project-specific retro could be held at this stage as well, but feedback related to this project has already been discussed at the two most recent retros.

I would like to propose that this project be moved to Success and the project thread closed, given this. @WordPress/openverse-maintainers, what do you think (and maybe specifically @AetherUnbound as the author of the proposal)? I can also wait to raise this question at the community meeting if preferred!

AetherUnbound commented 7 months ago

I would be comfortable moving this to Success!

stacimc commented 7 months ago

Excellent! Moving to Success 🎉

WordPress / openverse