dapr / dotnet-sdk

Dapr SDK for .NET
Apache License 2.0
1.11k stars 331 forks source link

State Management: QueryStateAsync behaviour is inconsistent across state stores #1111

Closed KrylixZA closed 1 year ago

KrylixZA commented 1 year ago

Context

Hey Dapr Community. This falls somewhat under bug report but also open for discussion as I think it'll evolve into a bug as more is understood.

My team is building a new system that leverages Dapr's State Management building block. The issue we're facing is around the consistency of the behaviour of the query endpoint itself when run against varying state stores. Specifically, across Redis and Cosmos DB. For local development, and for automated tests running in the CI portion of our pipeline, we are using Redis. However, when deployed, we are using Cosmos DB. We do not want developers pointing their machines to resources in the cloud as want to create fully contained development environments that can be run locally.

To that end, we have configured our docker compose environment to spin up an instance of the redis/redis-stack:latest image for our state store which is what is run locally for development and in our pipelines. This has largely achieved what we require for our development experience.

However, the data that is loaded into our state store initially is actually provided by another team which is not using Dapr. They have a CI/CD pipeline which pushes data into Cosmos DB. The data they push is a straight forward JSON object and we've asked them to include the id, partitionKey and value properties. They put their objects into the value property and set id and partitionKey to the key that we use internally within the system. There are multiple entries in this state store. All of these entries are segmented by customer as well. So if there are n-many customers, and m-many objects per customer, we have n*m entries in the state store.

Now, digging into the product architecture a little. We are building a distributed micro-service design. Some services will be doing real-time event processing. Others will be much less frequently used by customers where they can configure and see how things in the system are running. For the real-time event processing, simply leveraging GetStateAsync has been sufficient as internally we know the keys we need to query for out of the state store. However, now that we are introducing the customer aspect, we need to selectively filter out data that is only relevant to them. Given there is no native support to "Get All" and use LINQ to filter, and we cannot ask the customer to know about the keys in the state store, the only option we have currently available to us is the Query state feature, accepting the risks associated with features being in Alpha.

Expected Behavior

We were expecting the behaviour to be the same across Redis and Cosmos DB when using the QueryStateAsync method as it is with GetState.

Actual Behavior

However, we're finding that this isn't the case. With Cosmos DB, the data that the other team pushes to the state store is perfectly quarriable. However, that's not true of Redis.

When we load those same objects out of Cosmos DB into Redis for local development and testing, we have to pick the encoding mechanism. If we are working on a service that needs to leverage QueryStateAsync, we use an encoding that saves the objects as ReJSON-RL. However, if we want to run any code that uses GetStateAsync, we need to save it as a standard object (string). This means we have to be very selective. At any given time, some portions of the system will work locally and others will not.

Sure we could update our GetStateAsync methods to pass in contentType: application/json as part of the metadata, but then we end up writing code specifically for Redis in our system which is agnostic of underlying infrastructure. Ultimately, we don't want to step away from using Redis for local development as we don't want to force developers to bind their machines to resources in the cloud. Redis is also nice and lightweight. These same benefits are true when running tests in the CI portion of our pipelines.

Based on a little research around other open issues, it does seem that state stores fronted by some kind of SQL API (Cosmos DB, Azure SQL DB, PostgreSQL) support the QueryStateAsync method perfectly fine, while those that use some kind of JSON search are inconsistent. In our particular case, we have a requirement to use NoSQL document stores and still be able to query. In theory, according to the State store component specs, Redis and Cosmos DB fit that niche use case, but it's not quite the reality on the ground (yet).

Any advice on what we could do differently? Maybe a different state store for local development? TIA 🥇

PS: We've ruled out the Cosmos emulator because it needs to run on Linux in the CI portion of our pipeline and that emulator is very heavy in terms of resource usage and it wasn't immediately obvious as to why we couldn't get Dapr to connect to it.


Potentially related issues:

783

948

KrylixZA commented 1 year ago

Going to close this as it'll be resolved by https://github.com/dapr/dapr/issues/5146.