Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
743 stars 495 forks source link

High RU cost for case insensitive string functions #2175

Closed justinbhopper closed 2 years ago

justinbhopper commented 3 years ago

Describe the bug I'm not sure if this is the appropriate repository to report to. I have a container with a million documents, and it seems enabling the case insensitive flag in any string function causes a huge spike in request charge costs.

According to a blog released July, 2020, string functions now support case insensitivity and are supposed to be low cost: https://devblogs.microsoft.com/cosmosdb/new-string-function-performance-improvements-and-case-insensitive-search/

I don't exactly know how to communicate how to reproduce this other than basic examples.

Examples:

SELECT * FROM c 
where 
  c.partitionKey = 'Allergy' 
  and StringEquals(c.Patient.Name, 'John Smith', false)

RU Cost: 8.6 RUs

SELECT * FROM c 
where 
  c.partitionKey = 'Allergy' 
  and StringEquals(c.Patient.Name, 'John Smith', true)

RU Cost: 18,607 RUs

j82w commented 3 years ago

@sboshra or @timsander1 do you have any suggestions?

timsander1 commented 3 years ago

Is name a pretty high cardinality property? If it is, you can improve system function execution using this step in our troubleshooting guide.

Rewrite the query to: SELECT * FROM c where c.partitionKey = 'Allergy' and StringEquals(c.Patient.Name, 'John Smith', true) ORDER BY c.partitionKey, c.Patient.Name

And create a composite index on (c.partitionKey, c.Patient.Name)

justinbhopper commented 3 years ago

@timsander1 Yes it is a high cardinality. My example above is a simplified one. In the real world scenario, the data we are searching is in a nested array. It was my understanding that composite indexes do not support properties found within nested arrays. Is this correct?

So am I to understand that case insensitive searches do not utilize the index?

timsander1 commented 3 years ago

Case insensitive searches utilize the index but for StartsWith and StringEquals, do so less efficiently than case sensitive searches. The RU charge will increase as the cardinality of the property in the string system functions increases. However, 18,000 RUs seems abnormally high. Are you able to file a support ticket for this issue?

And yes, that's correct about using composite indexes with arrays. However, you could still try adding ORDER BY (even without a composite index). As long as the system function property is in the ORDER BY clause, it can make a big difference.

justinbhopper commented 3 years ago

@timsander1 Yes along with this github issue, I filed a support ticket and they are helping investigate too.

The biggest challenge with this is the fact that we chose to normalize our searchable data within a nested array. We got the idea from Microsoft's FHIR Server, which utilizes Cosmos. Before case insensitive searches were supported, we were able use this technique to store normalized text so that we could perform our own insensitive searches.

We stopped storing normalized text when we thought Cosmos began supporting case insensitivity natively. We're just now finding out that there are limitations like this and we may be better off continuing to do the normalization ourselves.

timsander1 commented 3 years ago

Here is some new documentation for improving system function index utilization : https://docs.microsoft.com/en-us/azure/cosmos-db/troubleshoot-query-performance#improve-string-system-function-execution

justinbhopper commented 3 years ago

@timsander1 Thank you for the documentation.

Something that isn't explained well is - what determines which properties should be placed in the ORDER BY and composite indexes in order to receive this performance improvement?

We often perform queries with expressions on a wide variety of properties, some of which are basic binary expressions and others are string functions. The example in the documentation is including both c.name, c.town, even though only c.town was using a string function. If there had been a 3rd property involved in the WHERE clause, would that property also need to go in the composite index?

timsander1 commented 3 years ago

You would want to include it if it were an equality filter (you can have up to 8 in a composite index) but you can only have a maximum of 1 range filter (which would be your string system function)

justinbhopper commented 3 years ago

@timesander1

Can cosmos internal indexing store a normalized version of the text along with the natural version? This way, case-insensitive string functions can use the normalized index.

e.g.

document:
{
  name: "Justin Hopper"
}
index: {
    "name": "Justin Hopper",
    "name_normalized": "JUSTIN HOPPER"
  }
}

We are already doing this manually ourselves in our Cosmos database, as we had to come up with case insensitive support on our own when we started using Cosmos 2 years ago. It worked well for us and we're considering reviving the technique and abandoning case-insensitive string functions altogether due to the reliance of customizing composite indexes.

An example of how we store our data: https://github.com/microsoft/fhir-server/blob/main/docs/SearchArchitecture.md#persistence

timsander1 commented 3 years ago

For now, you could do this manually. In the future (rough ETA is Q32021), we plan to add the ability to index system function so you could index UPPER() or LOWER(). You'd still have to explicitly index UPPER() or LOWER(), though

justinbhopper commented 3 years ago

@timsander1 Can you explain a bit further on the future enhancement of UPPER() and LOWER()? Are you saying that we will be able to use UPPER() for comparisons without a performance hit? Will this require explicitly defining an index on the property or will the auto-indexing support this?

timsander1 commented 3 years ago

That's correct - you'd be able index the UPPER() function's value so that you could do easy case insensitive comparisons and ORDER BY. We'll give you the ability to define a computed property and index it. This won't be by default, you'd have to specifically add it.