CosmosDB: Consider adding API for extracting the partition key(s) value(s) from documents for CRUD operations

mihaitodor commented 9 months ago

Feature Request

Currently, the azcosmos package requires passing in the value of the partition key when performing CRUD operations. While this works OK, there is no API that one can use to extract the partition key value given a document and a partition key, so, in certain cases, one has to parse this partition key (i.e. /foo/bar/baz) and then write some code to extract the value that corresponds to it from the input JSON document. While this is not a huge issue, it forces users to maintain a parser for partition keys, which might not have the same semantics as the CosmosDB server.

While this functionality won't be useful for every CRUD operation, I think it can help people who are developing systems that can bulk load data into CosmosDB. It can, for example, be used as a basic validation mechanism, where documents which do not have the field that the partition key points to can be rejected immediately client-side, without having to first send them to the server.

ealsur commented 9 months ago

@mihaitodor Thanks for the feedback.

where documents which do not have the field that the partition key points to can be rejected immediately client-side

Not having a Partition Key Value is a valid value for the system. Documents can have a value, an empty value, or no value.

which might not have the same semantics as the CosmosDB server.

The values can be strings, numbers, or booleans. There are no special semantics for Cosmos DB, this is the same for all client SDKs. The Partition Key is the value of the property defined in the Partition Key Definition. The extraction of the value is not something done by the SDK but rather something the service does, I understand that this can lead to cases where you might pass a value that generates a mismatch (HTTP 400 response) but adding this validation on the SDK side before sending the request is not something any of the other SDKs do. Please correct me if I'm wrong @jhendrixMSFT, but the goal is that client SDKs do not perform input validation if such validation is already present in the service.

jhendrixMSFT commented 9 months ago

That's correct. We defer to the service to perform validation.

mihaitodor commented 9 months ago

Thank you for the quick reply @ealsur and @jhendrixMSFT! I think I should've provided a clear example of what I'm asking for.

Given a container with partition key /foo/bar/baz and the following document that I am trying to upload to CosmosDB via the Create operation:

{
  "foo": {
    "bar": {
      "baz": 41
    }
  }
}

If I don't know the value that /foo/bar/baz points at in advance, I need to write code which a) parses /foo/bar/baz into individual elements, b) parses the document and c) traverses the document to find the partition key value.

While this isn't a big deal, I worry that, at least for a) and c), my code might have different semantics that the server-side CosmosDB code which and I could get unexpected results.

Documents can have a value, an empty value, or no value.

For the "no value" part, are you referring to azcosmos.NullPartitionKey?

ealsur commented 9 months ago

For the "no value" part

You can use azcosmos.PartitionKey{} to say that a document does not have the property at all.

Thanks for sharing more details on the request.

The issue is that on the SDK, we receive the byte[]. In order to extract the Partition Key Value, the SDK would need to:

Parse the raw content as a JSON.
Do a read on the Container, to know which is the Partition Key Definition (which is the path to the property)
Extract the value, if any.

This poses a performance penalty, and introduces the need for the client SDK to first validate that the content is valid JSON (what is considered valid JSON might differ from the backend, what if the backend service supports certain JSON content that Go for some reason does not consider valid?) in order to extract the value. We could improve it a bit (for example, having a cache of Partition Key Definitions to avoid doing it for every document) but still, the extraction process would not be 1:1 with the service (which is what you are asking about, the service does not use Go, so there is no way to match 1:1 exactly).

mihaitodor commented 9 months ago

Thank you for the extra details @ealsur!

You can use azcosmos.PartitionKey{} to say that a document does not have the property at all.

Good to know! I was going over this document and I guess it's worth adding support for this functionality in a generic ETL tool. I'm currently finalising the CosmosDB integration in Benthos.

This poses a performance penalty

This is why I don't think it should be done on the default path. In many cases, users should just have the partition key value up front.

However, it would be convenient if users would be offered an official Microsoft client-only utility API they can call to extract the partition key value (or values if they use hierarchical partition keys) from their document when they don't have the value already specified. It would just be a Microsoft-blessed utility which ensures this value is parsed and extracted using the same semantics that the CosmosDB server will use. Then they can simply pass the value along with the original document bytes to the current azcosmos APIs.

We could improve it a bit (for example, having a cache of Partition Key Definitions to avoid doing it for every document)

Not sure I follow, but the current implementation forces users to set a fixed partition key value per TransactionalBatch (curious what APIs you'll introduce to enable support for hierarchical partition keys in #21063), which is somewhat inconvenient since one needs to ensure that documents are grouped by partition key value when creating these batches. However, I think this is expected given that the server can then simply forward the entire batch to the appropriate shard.

but still, the extraction process would not be 1:1 with the service (which is what you are asking about, the service does not use Go, so there is no way to match 1:1 exactly).

Fair enough. I'm mostly worried about parsing the partition key path itself, since the specification looks rather vague. Not sure what's valid when arrays and numbers are tossed in the mix. For example, the UI errors out if I try to create a container with /foo/5/bar as partition key. What about non-ASCII characters?

My current approach is to avoid parsing partition key paths and offer a different format for specifying the query that should be run against documents to extract the partition key value, but it would be more user-friendly to allow people to just copy / paste the partition key definition.

ealsur commented 9 months ago

Let's try to take the points apart.

Utility

I understand the ask about having a Microsoft-blessed utility, we can evaluate this as a feature ask. What I understand is that you want a utility that navigates JSON. The point however is that such a utility needs to know 2 things:

The body of the document
The Partition Key Definition - this says which is the path where the Partition Key Value should be

Without any of these, it's impossible for a utility to do the work. At that point, it's just a JSON navigator, which is not Cosmos DB specific? Like, ExtractValue(byte[] content, string path)?

ensures this value is parsed and extracted using the same semantics that the CosmosDB server will use

This is the part that I want to again highlight, the only way to do this is if these were both the same languages right? You can have a logic that maps conceptually what one language does and what the other does, but if two components use different languages and each language can have differences (like, what is a valid JSON) then it's impossible to achieve exactly the same.

Partition Key requirement

The requirement for a Partition Key is part of the system design, not an SDK decision, and thus, cannot be changed. I understand the difficulty that knowing the Partition Key Value might pose, but this is part of what the application should be aware of. Partition Key has 3 valid values: A number, a string, or a boolean. In the Go SDK, these maps to NewPartitionKeyString, NewPartitionKeyBool, and NewPartitionKeyBool. The cases for nil/null values is also contemplated through NullPartitionKey for cases when the document (because JSON allows it) has a null value. Empty values is something that we need to support due to old Containers migrated from an era where the Partition Key was optional.

For example, the UI errors out if I try to create a container with /foo/5/bar as partition key. What about non-ASCII characters?

Whether or not non-ASCII (or other characters or paths) are allowed, is something the client SDKs won't validate, it's the service who decides if they are valid or not. The guideline that @jhendrixMSFT mentioned is that if the service performs the validation, then the client should not do it.

Hierarchical Partition Key support

A hierarchical partition key is just a Partition Key with multiple values. The internal structure supports this already, it's just the public API that does not.

https://github.com/Azure/azure-sdk-for-go/blob/6cdec79da53b0e16e3a47104da782a94ecf83244/sdk/data/azcosmos/partition_key.go#L13-L15

Similar to other languages, we would end up introducing an API that can create a PartitionKey out of a list of values instead of one. For reference, see C#: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/PartitionKeyBuilder.cs

From the SDK perspective, these values go into the request header, the SDK does not need to parse or validate them (that's the service job).

Azure / azure-sdk-for-go