Elfocrash / Cosmonaut

🌐 A supercharged Azure CosmosDB .NET SDK with ORM support
https://cosmonaut.readthedocs.io
MIT License
341 stars 44 forks source link

CountAsync Performs In-Memory #62

Closed ijabit closed 5 years ago

ijabit commented 5 years ago

Awesome library!! However, when I use CountAsync I get a query that looks like this:

{"query":"SELECT VALUE root FROM root WHERE (root[\"AccountId\"] = XXXXX) "}

The regular Count() function will properly transform the query into SELECT VALUE COUNT(1) FROM... The performance implications of pulling all of the data in memory to perform the count are obviously very bad.

My query is nothing fancy, it looks like this:

await ArtistStore.Query().Where(x => x.AccountId == accountId).CountAsync();
Elfocrash commented 5 years ago

Hello @ijabit and thanks!

Did you profile those two calls in terms of speed and RUs?

The reason why CountAsync will create the query you see if because internally it is using the non-blocking pattern of paginated results so your application is not blocking the thread. If you do select value count(1) you cannot have the same pagination logic as you are aggregating a call across all partition key ranges.

Does this answer your question?

ijabit commented 5 years ago

Actually, AccountId is my partition key here so I don't need to query cross-partition. I tried adding the partition key to the query myself (do I need to do that since I put the PartitionKey decorator on that property??):

var results = await
    ArtistStore
        .Query(new FeedOptions { PartitionKey = new PartitionKey(accountId) })
        .Where(x => x.AccountId == accountId)
        .CountAsync()

I don't understand where paging comes in here, can you explain? Why would it need paging if it's getting back a single value from the query?

As for measuring the performance difference, I'm only returning a couple hundred rows so I can't really tell overall, but with queries returning thousands or hundreds of thousands of rows that could be a huge issue if I'm understanding what is going on here.

Elfocrash commented 5 years ago

If you have decorated your class then you shouldn't need to also add it in the FeedOptions. Cosmonaut will do that for you. Also, if a query has the partition key in the where clause, the Cosmos DB server is smart enough to only run it in a single partition anyway.

CountAsync is using the DocumentClient CountAsync method which is outside of our control. The reason why I say that there is pagination involved is that Cosmos DB, if there are many documents in there, doesn't return everything in a single resultset but rather multiple. That is true for both cross partition queries but also partition specific ones. It is simply how it works.

However if you run the aggregated count(1) function in the query, you are offloading this operation to the server side. This is bad because the server will take some time aggregating all the data internally and then returning the single value to you. During that time your applications hangs and waits for data as this is a synchronous operation.

However, CountAsync, will query a single value and aggregate on the client. This will perform better because you are not waiting for the server to give you a final value after some seconds. You are building this yourself and you application flows normally.

ijabit commented 5 years ago

Very good to know on the PartitionKey part. This is being called from an API that the web app calls, so blocking the UI isn't really an issue here. If you had a million documents to count, it seems crazy to pull a million 1k documents across the wire and perform a count in memory! That's a 1GB total download just to perform a count. The SDK must be doing something more clever under the covers.

But I didn't realize that the CountAsync stuff was part of the MS SDK, not this library. So I'll close this issue.

Thanks!

Elfocrash commented 5 years ago

Oh wait a second I just had a better look at the query and you are right, it looks like it retrieves the full document. I thought it was using just a series of 1s like the BulkExecutor library does.

I will do some more investigation on this and if it makes sense I will override the method with better handling for count where I only get a single digit per document. I will have a chat with the Cosmos DB team to see what their thoughts are on this. Thanks for raising it!