Research "Brown out" strategy for Solr to ES switch

s-taube commented 2 weeks ago

We would like to implement a "brown out" strategy to encourage API and Webhook users to migrate to the latest versions (V4 of the API and v2 of Webhooks, which utilize ElasticSearch rather than Solr). We envision doing so by giving these users more frequently-occurring errors when making API calls to previous versions.

For example, using waffles, we could make it so that 10% of API calls that use Solr do not work. And this would gradually increase over a week or two (20% the next day, 30% the day after that, etc.). Note: it is not clear if the impact would be to a certain percentage of cookies (i.e. users) or a percentage of all API calls.

Definition of Done: Decide on the best strategy for a "brown out" of Solr (v3 of APIs and v1 of Webhooks).

albertisfu commented 2 days ago

@mlissner Here are my findings and proposals regarding this issue.

Goal:

We need to display error messages to users who are currently using:

V3 of the Search API (Opinions and RECAP)
V1 of Search Alerts webhooks (Opinions)

So this error message should be encouraged users to migrate to:

V4 of the Search API or V3 of the Search API with the backwards incompatible changes when switching to ES.
V2 of Webhooks for Opinions Alerts or V1 of Webhooks Opinions Alerts with the backwards incompatible changes when switching to ES.

The error messages should only be displayed while these endpoints are running on Solr. Once we switch these endpoints to ES, these messages should no longer appear.

So the strategy we'll need to apply should be the following:

For V3 Search API while o-es-search-api-active and r-es-search-api-active are disabled for all users we should apply the logic to show the error message to a defined percentage of users.

Django waffle Flags works as follows:

If the Flag everyone status is set to unknown and you select a percentage of users to return the flag as active it does the following:

First, it looks for a Cookie in the request (e.g., dwf_o-es-active=True|False)
- If the cookie exists, it returns the flag status from the cookie
- If the cookie doesn't exist, it uses a random function to determine the flag status:

if Decimal(str(random.uniform(0, 100))) <= self.percent:
    set_flag(request, self.name, True, self.rollout)
    return True
set_flag(request, self.name, False, self.rollout)

This method also sets the Cookie according to the status determined by the random function. For future requests, once the cookie is set, consecutive requests from that user will use the flag status stored in the cookie.

This approach will works well for the frontend. However, for API requests that don't use cookies, the flag status will be determined randomly on every request. If we create a flag called brown-out-solr to determine whether to throw an error on Solr Search API requests, it will work at the request level instead of the user level. This means some queries will succeed while others will throw errors based on the defined percentage. If this is acceptable, we can simply create this new flag and modify the V3APIPermission class to consider this new condition and display a custom error message when the flag is active for the request.

Alternative User-Based Approach

In case we want to display these errors to a percentage of users and ensure all their requests show the error once they fall within the defined percentage, we will need to take a different approach.

We could use a similar logic to that used by Django Waffle Flag, based on a random function and the defined percentage. In fact, we could continue using the Flag for setting the percentage and rely on its random method to determine whether the Flag should be active.

So to display errors consistently for users within the defined percentage, we could:

Use similar logic to Django flag's random function and percentage definition
Create a Redis SET to store user flag status (replacing the cookie functionality)
For new requests:
- If the key doesn't exist in Redis, use the random method
- Store the determined flag status in Redis
- Use this stored status for subsequent requests

Roll out A Flag includes a mode called "rollout," which is useful for gradually rolling out features. If this mode is active and the random function determines that the flag should be "Off" for a user, a session cookie is stored, which will be removed when the user closes their browser. On the other hand, if the random function determines that the flag should be "Active," the cookie is set with an expiration determined by WAFFLE_MAX_AGE, which defaults to one month. This way, when the percentage is increased, users whose flags were previously set to "Off" can be updated to "True" based on the new percentage.

We will need to consider a similar approach in our strategy if we are going to use Redis to make this flag user-based. The simplest solution would be to define a key expiration according to our rollout strategy. For instance, if we plan to increase the user percentage daily to display the error message, we should set the expiration of keys marked "Disabled" to one day. Keys marked as "Active" can have a longer expiration possibly several months or no expiration at all if we plan to clean them up after removing this code. This approach would allow us to achieve the same gradual rollout method when increasing the user percentage.

V1 Opinion Search Alerts Webhooks

Here the decision to display the error message would work slightly differently. To control whether to use Solr or ES for sending alerts, we use a Switch instead of a Flag because they are not session-based, and user requests are not involved when sending alerts. However, we can still apply a similar approach aligned with the one selected for V3 of the Search API. If we choose to use a random approach for sending alerts, we could display the error message instead of the regular webhook payload whenever the random method determines to do so, based on the defined percentage.

If we want this to be user-based, we will similarly need to use a Redis Set to store the "flag" for webhook users.

Replacing the webhook payload with an error message will likely cause the webhook to fail on the client side, which is intentional, so the user can take action. However, this has a side effect: if the intentional failure returns a status code other than 2xx, the webhook event will retry according to our retry policy. This means the webhook event will fail on every retry. While this is perhaps the right approach since the user will receive multiple notifications about the webhook failure in accordance with our policy it is something to keep in mind.

Based on this, some final questions:

Which approach seems better to implement?
- Request-based using a random method: This will be easier to implement; however, it won't be consistent for users. Sometimes their requests/webhooks will succeed, and other times they will fail.
- User-based using Redis to store user status: This will be slightly more complex to implement but will provide a more consistent experience for users. Once they are included in the percentage to show errors, their behavior will remain consistent.
- I'd say the size of implementing any of the two approaches described is "medium" , with the user-based approach taking slightly more time to develop.
What happens if users want to continue using V3 of the Search API and V1 of Webhooks, and they are prepared for the incompatible backward changes when we switch to ES? These error messages would still affect those users as well until the switch to ES is completed. Is that correct?

Let me know what do you think.

mlissner commented 2 days ago

Thanks for all the details, Alberto. I think we might have explained the goal here well enough. The idea is that we want to switch to Elastic for the APIs and webhooks, but when we do, a few of the fields will be backwards incompatible.

We've sent lots of emails and have warned that we'll be making this change on November 25th, but to be extra courteous, we are thinking that we will slowly deploy the change. First, we'll make 10% of API requests and Webhook events use Elastic, then 20, then 30, etc.

So no error messages are displayed to users. We just slowly start returning the new responses. The idea is that if somebody hasn't upgraded, their system will start crashing here and there, and they'll figure out that they need to upgrade. At first, the crashes won't be many, but pretty quickly all of their API requests or webhook events will crash (because we'll be at 100% of the waffle).

Does that change your thinking? It doesn't need to be tied to users at all, and what we want to do is just slowly make the swap from Solr to Elastic.

albertisfu commented 2 days ago

Got it! Yeah, this feature makes more sense now.

It wouldn’t significantly affect either of the two proposed solutions (random request-based or user-based). The only difference is that, instead of displaying the error message, it'll use ES to query results and employs the ES serializers as well.

mlissner commented 2 days ago

Cool. So you'd still size this as M just to slowly make API results use Elastic increasingly commonly over the span of a week? Is that true even if I'm making the waffle percentage slightly larger each day? I was hoping to just be able to use the percentage part of the waffle configuration to slowly start using Elastic.

albertisfu commented 2 days ago

Yes, that’s correct! I’m actually considering still using the flag in the admin panel to define the usage percentage and its random function to determine whether the flag should be active for a request. If we decide to implement this on a user-based level, we’ll also need to use Redis to store the flag status while rolling out this feature.

So, I’d say either approach: request-based or user-based is M size, with the difference being that the request-based implementation would take approximately one day, while the user-based approach would require around two days

mlissner commented 1 day ago

OK, Alberto and I chatted about this, and it should be pretty easy. He's going to do one last check, but it looks like this should work. I'll create a new ticket for implementing the switch to Elastic next week.

albertisfu commented 1 day ago

Yes, that's correct. I got confused initially because I thought we wanted to display a custom error message for users. However, the goal is simply to start serving requests using Elasticsearch instead.

So, we're going to make this request-based. We’ll use the current flags, o-es-search-api-active and r-es-search-api-active, to set a percentage. Based on that percentage, requests will randomly be served using Elasticsearch instead of Solr. ~~No need of additional code.~~

albertisfu commented 1 day ago

Sorry, I just realized we need a small tweak in the code. Currently, the API flags o-es-search-api-active and r-es-search-api-active are checked twice within the code that renders API results. This could be problematic when we start using them based on a random percentage of users. On the first check, the flag might return True, and on the second, False, which could lead to errors when rendering results. To fix this, we’ll need to refactor the code to ensure the flag is checked only once per request.

mlissner commented 1 day ago

Darn and good point!

In that case, what size would you give this to do that refactor?

albertisfu commented 1 day ago

It'd easy XS I think.

mlissner commented 1 day ago

Great. I opened https://github.com/freelawproject/courtlistener/issues/4714 to do that work.

freelawproject / courtlistener

Research "Brown out" strategy for Solr to ES switch #4650