dapr / components-contrib

Community driven, reusable components for distributed apps
Apache License 2.0
543 stars 470 forks source link

Cosmos DB state store - does not support multi-master #701

Open sebader opened 3 years ago

sebader commented 3 years ago

When using a multi-master write Azure Cosmos DB, does the state store component take this into account when I'm running Dapr in multiple geographical locations and redirects requests to the closest Cosmos DB region - instead of always going to the primary region?

sebader commented 3 years ago

ok, after digging into this myself a bit, unfortunately I have to say: No, it doesn't support it at all.

Let me explain on the following setup and the implications:

I have 3 regional deployments of my app. Lets say an AKS cluster in EastUS, WestEurope and EastAsia. I have a global load balancer (Azure Front Door or Traffic Manager) in front that sends client requests to the clusters based on client region and can transparently fail over if need be. Cosmos DB is configured with multi-master write in all these regions. Dapr state store is configured the same in all three regions.

This all works nicely when you test it. The data in the state store is available, no matter which region a request hits.

However: When you look the Cosmos DB metrics, you will see that only the primary Cosmos DB region gets all requests. Primary region in that case is the first region in the list of replication locations.

This has a couple of severe implications:

How to work around this? Actually, I don't know :( For other languages the Azure SDKs do offer ways to support multi-master (https://docs.microsoft.com/en-us/azure/cosmos-db/how-to-multi-master?tabs=api-async). The best solution is the one in the .NET SDK where you tell your app itself in which region it is running and the SDK figures out which is the closest cosmos db endpoint (parameter ApplicationRegion). For other languages you can at least specify a list of preferred locations. However, since the Azure SDK for Go does not support to query Cosmos DB in the first place and the Dapr component thus uses this 3rd party SDK (which hasn't been updated in a very long time), it might be a bigger issue how to solve this.

KaiWalter commented 3 years ago

@sebader - please checkout our configuration https://dev.to/kaiwalter/using-azure-private-links-and-private-dns-zones-with-globally-distributed-resources-4ce3

Entries in private DNS zones in each region point to the next / regional Cosmos DB private endpoint, so that the cluster / Dapr sidecar always writes "locally".

Does this make sense to you?

sebader commented 3 years ago

thanks @KaiWalter ! This sure does look like an interesting workaround. But in the end it really is only that. You still don't get any real failover. If Cosmos has an issue in a region, you cannot fail over and basically you have to shut off your entire region (AKS etc.).

Plus, you could probably already achieve the same without using Private Link: In each region instead of using the default cosmos db connection string in your Dapr binding, you modify it to include the region (mycosmos-westeurope.documents.azure.com...). But again, you dont get fail over.

KaiWalter commented 3 years ago

@sebader for us using Private Link and private DNS zones is the primary notion of shielding our environment (attack surface reduction). So it is not intended as a workaround with regards to the Cosmos endpoint routing per se - just a side effect. For us, if one of the main resources like AKS, SQL, CosmosDb, ... goes down in a region we would shift the whole workload to another region anyway. This is why we do not invest in fail over on a resource level. But again, our use case of multi master maybe to special here in the context of this issue.

dapr-bot commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

dapr-bot commented 3 years ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

harvendra2022 commented 2 days ago

@sebader - Dapr does not support for multi-master read-write configuration/component with Cosmos DB, nor does it provide automatic failover capabilities similar to available in the Cosmos DB SDK. It would be highly beneficial if Dapr could extend its functionality to include support for multi-master read-write operations in Cosmos DB. Therefore, it would be greatly appreciated if this issue could be revisited and re-opened for further consideration.

yaron2 commented 2 days ago

Done @harvendra2022

harvendra2022 commented 2 days ago

Thank you, @yaron2!