Azure / azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB
MIT License
199 stars 119 forks source link

itemPatch API Does not work with / #479

Closed danzafar closed 1 year ago

danzafar commented 1 year ago

We are trying to deploy to production and have documents that are encrypted and will include /. We had no problem using the Spark-Cosmos connector to write (UPSERT) this documents, but when I use code the itemPatch API:

val writeStrategy = "spark.cosmos.write.strategy" -> "ItemPatch"
val fillConfig = "spark.cosmos.write.patch.columnConfigs" -> "[col(FILLS).path(/FILLS).op(set)]"

myDF
  .write.format("cosmos.oltp")
  .options(cfg + writeStrategy + fillConfig)
  .mode("append")
  .save()

We get this error:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 142.0 failed 4 times, most recent failure: Lost task 0.3 in stage 142.0 (TID 447) (10.123.64.202 executor 1): {"ClassName":"BulkOperationFailedException","userAgent":"azsdk-java-cosmos/4.36.0 Linux/5.4.0-1090-azure JRE/1.8.0_302","statusCode":404,"resourceAddress":null,"innerErrorMessage":"All retries exhausted for 'PATCH' bulk operation - statusCode=[404:0] itemId=[//-59474], partitionKeyValue=[[\"//-59474\"]]","causeInfo":null,"responseHeaders":"{x-ms-substatus=0}"}

Looking at your test cases, you explicitly don't use /: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/src/test/java/com/azure/cosmos/CosmosItemIdEncodingTest.java#L307

We are working with voltage-encrypted records, which will have these / characters sometimes. We did not see this limitation of the connector mentioned in any Java SDK or Spark connector docs. Can you please suggest a work-around or fix this issue?

FabianMeiswinkel commented 1 year ago

This is the github repository for the Spark 2 connector (which doesn't even support patch).

That said, "/", and "\" are disallowed characters for "id" - it has always been documented in our resource model documentation here: https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.resource.id?view=azure-dotnet

It only works for upserts if it effectively creates new documents - any point operations on an existing document (upsert, replace, delete, read) with id containing "/" or "\" will fail - only create (or upsert resulting in Create) and query are possible.

I agree that it is confusing that we even allow these documents to be created - reason for not preventing them from even getting created is that this would result in a breaking change (API-version with breaking behavior across backend and all SDKs) - which is not worth doing.

Bottom-line - the only fix is to not use "/" and "\" characters in "id" column. I assume you are seeing "/" because you use some base64 encoded value there. If so masking it is needed - see this code for example where we do the masking in our encryption library (also using base64 encrypted id with masking) https://github.com/Azure/azure-cosmos-dotnet-v3/blob/a34bac772e11642b8855251467d3f5a36a6c1fb8/Microsoft.Azure.Cosmos.Encryption/src/EncryptionProcessor.cs#L475-L489

FabianMeiswinkel commented 1 year ago

The limitation is also documented here: https://learn.microsoft.com/en-us/azure/cosmos-db/concepts-limits#per-item-limits

If you can tell me in which part of documentation you would have realized this limitation better, please let me know and I will work in the background with the right people to improve the documentation.

danzafar commented 1 year ago

@FabianMeiswinkel - I think we can both agree this is really in the weeds. It should be documented in the Java SDK and listed as a limitation in the Spark Cosmos connector. The fact that we have been using the connector since August should be an indication. If it was properly documented we would have seen it.