Azure / Azure-Proactive-Resiliency-Library-v2

Azure Proactive Resiliency Library v2 (APRL) - Source for Azure WAF reliability guidance and associated ARG queries
https://azure.github.io/Azure-Proactive-Resiliency-Library-v2/
MIT License
40 stars 59 forks source link

💡 Feature Request - Detect resiliency anti-pattern for CosmosDB multi-write+Bounded staleness #259

Open davihern opened 3 days ago

davihern commented 3 days ago

Describe the solution you'd like

In CosmosDB there is a documented anti-pattern. That is when the CosmosDB is configured as multi-write and has Bounded Staleness. https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels#bounded-staleness-consistency

In case those two settings are configured, WARA tool should add a warning, with the description: "Bounded Staleness in a multi-write account is an anti-pattern. This level would require a dependency on replication lag between regions, which shouldn't matter if data is read from the same region it was written to."

FallenHoot commented 3 days ago

Relying on Bounded Staleness could indeed be considered an anti-pattern. The suitability of Bounded Staleness depends on the application's requirements for data consistency. If an application necessitates that reads always reflect the most recent writes across all regions, then Bounded Staleness, which allows for some lag, would not be appropriate.

The choice of consistency level should align with the application's needs:

This all goes back to the RTO/RPO. In essence, the decision on consistency levels should be tailored to the specific demands of your application.

Best way to explain it is using a video game analogy. Imagine you're playing a video game with friends online, and you all are in different parts of the world. Now, the game's fun only if everyone sees the same game world at the same time, right? Bounded Staleness is like a setting that says it's okay if some friends see a few seconds of delay in the game world. It's fine for some games, but not for others where you need to see changes instantly.

So, if your game (or app) needs everyone to see the updates immediately, no matter where they are, Bounded Staleness isn't the best setting. You'd want something like Session consistency, which is like a game that updates for everyone as soon as anything changes, but only for the people playing right now.

If your game can handle a little delay and doesn't need the updates in exact order, then Consistent Prefix is like a setting that makes sure no one misses any part of the game, even if they see it a bit late.

And if it's okay for the game to update at different times for everyone, as long as it eventually gets updated, that's like Eventual consistency. It's the chill mode where the game doesn't stress about everyone being perfectly in sync.

So, it all depends on what kind of game you're playing—some need to be super in-sync, and some can be laid-back about updates. It really depends on how you want the application to perform,

davihern commented 3 days ago

I meant the combination of having multi-write regions AND bounded staleness. As the official documentation states: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels#bounded-staleness-consistency

image

There may be some corner-cases, where customer want multi-write, in order to have data geo-replicated as soon as possible, and limit read/writes in a single region (and have the other secondary write region as a fail over in DR scenario). But it is worth for WARA to alert and review if that is really what customer need, or if there are better alternatives.