Azure / azure-cosmosdb-bulkexecutor-dotnet-getting-started

Bulk Executor Utility for Azure Cosmos DB .NET SQL API
MIT License
66 stars 41 forks source link

  Azure Cosmos DB BulkExecutor library for .NET

The Azure Cosmos DB BulkExecutor library for .NET acts as an extension library to the Cosmos DB .NET SDK and provides developers out-of-the-box functionality to perform bulk operations in Azure Cosmos DB.

Table of Contents - [Consuming the Microsoft Azure Cosmos DB BulkExecutor .NET library](#nuget) - [Bulk Import API](#bulk-import-api) - [Configurable parameters](#bulk-import-configurations) - [Bulk import response object definition](#bulk-import-response) - [Getting started with bulk import](#bulk-import-getting-started) - [Performance of bulk import sample](bulk-import-performance) - [API implementation details](bulk-import-client-side) - [Bulk Update API](#bulk-update-api) - [List of supported field update operations](#field-update-operations) - [Configurable parameters](#bulk-update-configurations) - [Bulk update response object definition](#bulk-update-response) - [Getting started with bulk update](#bulk-update-getting-started) - [Performance of bulk update sample](bulk-update-performance) - [API implementation details](bulk-update-client-side) - [Performance tips](#additional-pointers) - [Contributing & Feedback](#contributing--feedback) - [Other relevant projects](#relevant-projects)

Consuming the Microsoft Azure Cosmos DB BulkExecutor .NET library

This project includes samples, documentation and performance tips for consuming the BulkExecutor library. You can download the official public NuGet package from here.

Bulk Import API

We provide two overloads of the bulk import API - one which accepts a list of JSON-serialized documents and the other a list of deserialized POCO documents.

Configurable parameters

Bulk import response object definition

The result of the bulk import API call contains the following attributes:

Getting started with bulk import

IBulkExecutor bulkExecutor = new BulkExecutor(client, dataCollection); await bulkExecutor.InitializeAsync();

// Set retries to 0 to pass complete control to bulk executor. client.ConnectionPolicy.RetryOptions.MaxRetryWaitTimeInSeconds = 0; client.ConnectionPolicy.RetryOptions.MaxRetryAttemptsOnThrottledRequests = 0;


* Call BulkImportAsync API
```csharp
BulkImportResponse bulkImportResponse = await bulkExecutor.BulkImportAsync(
    documents: documentsToImportInBatch,
    enableUpsert: true,
    disableAutomaticIdGeneration: true,
    maxConcurrencyPerPartitionKeyRange: null,
    maxInMemorySortingBatchSize: null,
    cancellationToken: token);

You can find the complete sample application program consuming the bulk import API here - which generates random documents to be then bulk imported into an Azure Cosmos DB collection. You can configure the application settings in appSettings here.

You can download the Microsoft.Azure.CosmosDB.BulkExecutor nuget package from here.

Performance of bulk import sample

Let us compare the performace of the bulk import sample application against a multi-threaded application which utilizes point writes (CreateDocumentAsync API in DocumentClient)

Both the applications are run on a standard DS16 v3 Azure VM in East US against a Cosmos DB collection in East US with 1 million RU/s allocated throughput.

The bulk import sample is executed with NumberOfDocumentsToImport set to 25 million and NumberOfBatches set to 25 (in App.config) and default parameters for the bulk import API. The multi-threaded point write application is set up with a DegreeOfParallelism set to 2000 (spawns 2000 concurrent tasks) which maxes out the VM's CPU.

We observe the following performance for ingestion of 25 million (~1KB) documents into a 1 million RU/s Cosmos DB collection:

Time taken (sec) Writes/second RU/s consumed
Bulk import API 262 95528 494186
Multi-threaded point write 2431 10280 72481

As seen, we observe >9x improvement in the write throughput using the bulk import API while providing out-of-the-box efficient handling of throttling, timeouts and transient exceptions - allowing easier scale-out by adding additional BulkExecutor client instances on individual VMs to achieve even greater write throughputs.

API implementation details

When a bulk import API is triggered with a batch of documents, on the client-side, they are first shuffled into buckets corresponding to their target Cosmos DB partition key range. Within each partiton key range bucket, they are broken down into mini-batches and each mini-batch of documents acts as a payload that is committed transactionally.

We have built in optimizations for the concurrent execution of these mini-batches both within and across partition key ranges to maximally utilize the allocated collection throughput. We have designed an AIMD-style congestion control mechanism for each Cosmos DB partition key range to efficiently handle throttling and timeouts.

These client-side optimizations augment server-side features specific to the BulkExecutor library which together make maximal consumption of available throughput possible.


Bulk Update API

The bulk update (a.k.a patch) API accepts a list of update items - each update item specifies the list of field update operations to be performed on a document identified by an id and parititon key value.

    Task<BulkUpdateResponse> BulkUpdateAsync(
        IEnumerable<UpdateItem> updateItems,
        int? maxConcurrencyPerPartitionKeyRange = null,
        int? maxInMemorySortingBatchSize = null,
        CancellationToken cancellationToken = default(CancellationToken));

List of supported field update operations

Supports incrementing any numeric document field by a specific value

class IncUpdateOperation<TValue>
{
    IncUpdateOperation(string field, TValue value)
}

Supports setting any document field to a specific value

class SetUpdateOperation<TValue>
{
    SetUpdateOperation(string field, TValue value)
}

Supports removing a specific document field along with all children fields

class UnsetUpdateOperation
{
    SetUpdateOperation(string field)
}

Supports appending an array of values to a document field which contains an array

class PushUpdateOperation
{
    PushUpdateOperation(string field, object[] value)
}

Supports removing a specific value (if present) from a document field which contains an array

class RemoveUpdateOperation<TValue>
{
    RemoveUpdateOperation(string field, TValue value)
}

Note: For nested fields, use '.' as the nesting separtor. For example, if you wish to set the '/address/city' field to 'Seattle', express as shown:

    SetUpdateOperation<string> nestedPropertySetUpdate = new SetUpdateOperation<string>("address.city", "Seattle");

Configurable parameters

Bulk update response object definition

The result of the bulk update API call contains the following attributes:

Getting started with bulk update

IBulkExecutor bulkExecutor = new BulkExecutor(client, dataCollection); await bulkExecutor.InitializeAsync();

// Set retries to 0 to pass complete control to bulk executor. client.ConnectionPolicy.RetryOptions.MaxRetryWaitTimeInSeconds = 0; client.ConnectionPolicy.RetryOptions.MaxRetryAttemptsOnThrottledRequests = 0;


* Define the update items along with corresponding field update operations
```csharp
SetUpdateOperation<string> nameUpdate = new SetUpdateOperation<string>("Name", "UpdatedDoc");
UnsetUpdateOperation descriptionUpdate = new UnsetUpdateOperation("description");

List<UpdateOperation> updateOperations = new List<UpdateOperation>();
updateOperations.Add(nameUpdate);
updateOperations.Add(descriptionUpdate);

List<UpdateItem> updateItems = new List<UpdateItem>();
for (int i = 0; i < 10; i++)
{
    updateItems.Add(new UpdateItem(i.ToString(), i.ToString(), updateOperations));
}

You can find the complete sample application program consuming the bulk update API here. You can configure the application settings in appSettings here.

In the sample application, we first bulk import documents and then bulk update all the imported documents to set the Name field to a new value and unset the description field in each document.

You can download the Microsoft.Azure.CosmosDB.BulkExecutor nuget package from here.

Performance of bulk update sample

When the given sample application is run on a standard DS16 v3 Azure VM in East US against a Cosmos DB collection in East US with 1 million RU/s allocated throughput - with NumberOfDocumentsToUpdate set to 25 million and NumberOfBatches set to 25 (in App.config) and default parameters for the bulk update API (as well as bulk import API), we observe the following performance for bulk update:

Updated 25000000 docs @ 53796 update/s, 491681 RU/s in 464.7 sec

API implementation details

The bulk update API is designed similar to bulk import - look at the implementation details of bulk import API for details.


Bulk Delete API

The bulk delete API accepts a list of <partitionKey, documentId> tuples to delete in bulk.

    Task<BulkDeleteResponse> BulkDeleteAsync(
        List<Tuple<string, string>> pkIdTuplesToDelete,
        int? deleteBatchSize = null,
        CancellationToken cancellationToken = default(CancellationToken));

Configurable parameters

Bulk delete response object definition

The result of the bulk delete API call contains the following attributes:

Getting started with bulk delete

BulkExecutor bulkExecutor = new BulkExecutor(client, dataCollection); await bulkExecutor.InitializeAsync();

// Set retries to 0 to pass complete control to bulk executor. client.ConnectionPolicy.RetryOptions.MaxRetryWaitTimeInSeconds = 0; client.ConnectionPolicy.RetryOptions.MaxRetryAttemptsOnThrottledRequests = 0;


* Define the list of <PartitionKey, DocumentId> tuples to delete
```csharp
List<Tuple<string, string>> pkIdTuplesToDelete = new List<Tuple<string, string>>();
for(int i=0; i < NumberOfDocumentsToDelete; i++)
{
    pkIdTuplesToDelete.Add(new Tuple<string, string>(i.ToString(), i.ToString()));
}

You can find the complete sample application program consuming the bulk delete API here. You can configure the application settings in appSettings here.

In the sample application, we first bulk import documents and then bulk delete a portion of the imported documents.

You can download the Microsoft.Azure.CosmosDB.BulkExecutor nuget package from here.


Performance tips


Contributing & feedback

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

See CONTRIBUTING.md for contribution guidelines.

To give feedback and/or report an issue, open a GitHub Issue.


Other relevant projects