PlagueHO / CosmosDB

PowerShell Module for working with Azure Cosmos DB databases, collections, documents, attachments, offers, users, permissions, triggers, stored procedures and user defined functions.
http://dscottraynsford.com
MIT License
152 stars 46 forks source link

Bulk Create, Bulk Update and Bulk Delete Documents #445

Closed Naveen-Rangarajan closed 2 years ago

Naveen-Rangarajan commented 2 years ago

Issue

Before submitting your issue for the CosmosDB project, please take a moment to provide the following details:

I have created a partition key named "SM_GUID".

I am trying to do Bulk Operations on Create, Upload and Delete JSON Documents in Azure Cosmos DB. I had gone through the issues #425. But, I am not able to make out proper syntax for the same.

Also, how the -Body is handled in $document?

Thank you for this wonderful Module. Please let me know if anything needs to be shared from my end.

PlagueHO commented 2 years ago

Hi @Naveen-Rangarajan - technically the REST APIs don't support bulk upload: https://docs.microsoft.com/en-us/rest/api/cosmos-db/create-a-document - so the Cosmos DB module doesn't implement this. Given that every document post requires a partition key to be provided in the header of the request it wouldn't be possible to post multiple documents in a single REST API post anyway.

Note: The Document DB SDK does support bulk upload, but that uses a different interface and is a different story. In hind sight it would have been a good idea to have this module use the Document DB SDK - but again that's another discussion.

However, what I think you're probably wanting is to be able to post multiple documents in a very short amount of time. In which case you'd just use a Foreach-Object in PS 5.1/6 or ForEach-Object -Parallel in PS 7. Here is an example of the former: https://github.com/PlagueHO/CosmosDB/wiki#working-with-documents

To get very high throughput you'd use Foreach-Object -Parallel - https://devblogs.microsoft.com/powershell/powershell-foreach-object-parallel-feature/.

Does that make sense? And thank you - I'm really glad the module is useful :grin:

Naveen-Rangarajan commented 2 years ago

Thank you for the update Dan and thanks again for clarifying this.

Since in my case, the partition key along with other data is retrieved from a CSV file. The partition key is named as SM_GUID which has value like c0XxxXax-x53X-4XX6-aXXX-aXXxbeXxXdbx.

So, can we have Foreach-Object loop running for each SM_GUID once the data is retrieved from CSV to say something like hashTable?

And is there any timeout where the connection gets terminated after some time. The amount of data is 14K-16K items. What will be the ideal scenario to handle this? Can you please suggest some options.

PlagueHO commented 2 years ago

Yep - that's easy enough to do with a ForEach-Object loop. Is the partition key stored in the CSV? We'd just use the Import-Csv to pull all the values into an array of hashtables and the loop through it assembling the documents and posting them one by one with New-CosmosDbDocument ... -DocumentId $guid -PartitionKey $guid -DocumentBody ....

If you used ForEach-Object -Parallel then it would go much faster - although be aware that the speed will depend on the RU/s assigned and your bandwidth. You also might want to configure a retry policy because if you do hit your RU/s limit you'll get 429s returned. You'll want the module to automatically back-off and retry rather than fail: https://github.com/PlagueHO/CosmosDB/wiki#how-to-handle-exceeding-provisioned-throughput

Naveen-Rangarajan commented 2 years ago

~16K item's partition key named SM_GUID will be available in the CSV.

There will be updates to the data. So the same concept (either ForEach-Object or ForEach-Object -Parallel) applies to Set-CosmosDbDocument as well right?

The command you provided above --> New-CosmosDbDocument has -DocumentId $guid which is for Attachments. Am I correct. So I might not need this right? Because, I just insert the data as JSON into the DB.

Also, I was working on a POC to update the data. I was trying for partial update where I just need to update one specific value based on the key, but the Set-CosmosDbDocument replaces the entire JSON.

Example below -

Existing Data as JSON in the DB is

{ "id": "1", "SM_GUID": "XXX-XXXXX-XXXX", "Title" : "SM_1", "Owner" : "Test.owner@gmail.com", "Member" : " " ---> This is empty initially }

Now I am trying to update the member by having below JSON

{ "id": "1", "SM_GUID": "XXX-XXXXX-XXXX", " Member" : "member1@gmail.com ; member2@gmail.com" }

Resulting JSON in the DB is

{ "id": "1", "SM_GUID": "XXX-XXXXX-XXXX", " Member" : "member1@gmail.com ; member2@gmail.com" }

Other key-values are getting deleted. Is this the behavior? Else, do I need to pass entire JSON object to Set-CosmosDbDocument? If yes, will this be time consuming to process thousands of records.

Please help me.

Naveen-Rangarajan commented 2 years ago

@PlagueHO , Can you please update on the above query. Your reply is much appreciated.

PlagueHO commented 2 years ago

Hi @Naveen-Rangarajan - sorry about the delay - this totally slipped my mind.

Yes, this is the behavior of the CosmosDB REST API - you can't ship a partial document to Cosmos DB and have it be "merged" - at least not via the REST API. If you want to do a merge/modify then you'll need to do a read and set approach.

Regarding performance though - this is a much more interesting question. What I'd do (and have done for data sets with 15 million records) is the following:

  1. Use a massively parallel processing to push 100 documents at a time. Each document get will take ~1 RU and each set will take about >2 RU (at a guess). So that is 3 RU per document - and the read is less than 33% of the overall performance. So if you have 10000 RU/s assigned then you can do about 3333 documents a second - that's assuming your client side can pump that volume.
  2. Temporarily scale up the RU/s of the Cosmos DB to reduce the number of 429s (and therefore performance impact) - then scale the RU/s back down. Be aware of how Cosmos DB behaves with scaling RU/s up and down and how it affects that physical partitions: https://docs.microsoft.com/en-us/azure/cosmos-db/scaling-provisioned-throughput-best-practices#background-on-scaling-rus and https://docs.microsoft.com/en-us/azure/cosmos-db/scaling-provisioned-throughput-best-practices#how-to-scale-up-rus-without-changing-partition-layout

IMHO the read and set won't actually be a big impact to this if you scale up and parallelize.

So the logic would go something like this (not real code - just pseudocode):

# Scale up RUs
$data = ConvertFrom-CSV $mydata
$data | Foreach-Object -Parallel -ThrottleLimit 100 {
    $doc = Get-CosmosDbDocument -DocumentId $_.smid -PartitionKiey $.smid
    if ($doc) {
    # document exists so merge data and issue `Set-CosmosDbDocument -DocumentId $_.smid -PartitionKiey $.smid`
    } else {
    # document does not exist so create with `New-CosmosDbDocument -DocumentId $_.smid -PartitionKiey $.smid`
    }
} # loop
# Scale down RUs

In theory we could implement a document merge helper function that could combine two document bodies together - but this would be limited use because you'd really need to understand the logic of how to merge two documents. And it would still implement a read and set approach.

Does that make sense?

Naveen-Rangarajan commented 2 years ago

@PlagueHO,

Thank you for the detailed explanation. I will further explore on the options provided and will check with you for any clarifications.

Cheers, Naveen