Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
731 stars 489 forks source link

Passing partial partition key in QueryRequestOptions hurts COUNT index utilization #3936

Open vit-svoboda opened 1 year ago

vit-svoboda commented 1 year ago

We are continuously addressing and improving the SDK, if possible, make sure the problem persist in the latest SDK version.

Describe the bug Including the same partial partition key (that is already in the aggregate count query WHERE clause) in the query options hurts the query performance. Was I not supposed to pass the partition key in there at all? I thought we were supposed to pass the partition key wherever possible.

To Reproduce Suppose I have a Cosmos DB "CountTest" with a collection "Subpartitioned" with a 2-part hierarchical partition key /TenantId, /id and I want to count all documents belonging to a specific tenant, say "1". A query SELECT VALUE COUNT(1) FROM c WHERE c["TenantId"] = "1" is quick and cheap, so far so good. To my surprise, if I execute this query using the latest SDK and pass new PartitionKeyBuilder().Add("1").Build() in the QueryRequestOptions the index utilization goes down significantly (down to 10% on the test data described here, down to 0% in my real life deployment). And on bigger data sets the query become prohibitively expensive.

Consider the following code.

using Microsoft.Azure.Cosmos;
using Microsoft.Azure.Cosmos.Linq;

var cosmosClient = new CosmosClient(Environment.GetEnvironmentVariable("CosmosConnectionString"));
var container = cosmosClient.GetDatabase("CountTest").GetContainer("Subpartitioned");

var expensiveResponse = await container.GetItemLinqQueryable<Document>(requestOptions: new QueryRequestOptions
    {
        PartitionKey = new PartitionKeyBuilder()
            .Add("1")
            .Build()
    })
    .Where(doc => doc.TenantId == "1")
    .CountAsync();

var cheapResponse = await container.GetItemLinqQueryable<Document>(requestOptions: new QueryRequestOptions
    {
        PartitionKey = null
    })
    .Where(doc => doc.TenantId == "1")
    .CountAsync();

Console.WriteLine($"I would expect {expensiveResponse.RequestCharge} to be the same as {cheapResponse.RequestCharge} but it's much higher.");

public class Document
{
    public string TenantId { get; set; }
    public string id { get; set; }
}

As a workaround, I can omit the PartitionKey from the QueryRequestOptions and the index utilization goes back to 100% and RU cost back down to single digits.

Expected behavior I would expect the count query index utilization to be the same if the partition key is inferred from the query as if it's explicitly passed in the request options. Passing the partition key in the options may seem a bit pointless, but I would not expect it to make performance worse.

Actual behavior Explicitly passing the (partial) partition key appears to hurt index utilization and thus query charge. Based on the diagnostics, the database has to retrieve each document it counts. Though on the output there's only the one document with a number.

Environment summary SDK Version: 3.35.0 OS Version: Windows 10

Additional context Obviously, this is only issue with subpartitions/hierarchical partition keys, as originally one could only pass the entire partition key and not just a part of it. Furthermore, I haven't observed any index utilization issues with normal document-fetching queries. Appears to be only a problem with the aggregated ones. It doesn't matter whether I use LINQ or the string based QueryDefinition.

ealsur commented 1 year ago

See https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3933 and https://github.com/Azure/azure-cosmos-dotnet-v3/pull/3934

There is currently an issue with Queries and Subpartitioning when the Partition Key is partial.

vit-svoboda commented 1 year ago

Does that mean my observation regarding request options was incorrect and the entire thing won't work reliably outside of the specific case I tested?

ealsur commented 1 year ago

@vit-svoboda What was discovered was that when using Partial Partition Keys with Queries, it fails to correctly route the query in some cases (when the partial Partition Key spans multiple physical partitions). With this bug still unresolved, it is hard to understand if the problem you are facing is really an issue with the query execution or really just linked to the routing problem.

vit-svoboda commented 8 months ago

I just retested the issue using the latest SDK (v3.37.1) where the possibly related issues mentioned above have been fixed. But my issue still appears to be there.