Azure / azure-cosmos-dotnet-v2

Contains samples and utilities relating to the Azure Cosmos DB .NET SDK
MIT License
579 stars 837 forks source link

Performance issues with 2.0.0 #587

Closed Elfocrash closed 6 years ago

Elfocrash commented 6 years ago

So I was in the process of updating Cosmonaut to use the 2.0.0 version of DocumentDB.Core that was just released. Once I did that I ran the sample code and it was noticeable slower.

1e8ff2f6-2c89-4ab2-883d-54a7fed5a267

These results are pretty consistent as well. Note here that the package version is the ONLY thing I changed.

The sample code can be found here

EDIT

I also dived a bit deeper to see if there is something in the versions that make this performance issue clear.

As you can see in the image below it seems that some change in 2.0.0-preview make the performance really bad and 2.0.0-preview2 managed to fix most of that but 2.0.0 is pretty much 2.0.0-preview2's performance which is 2-3 time worse than 1.9.1.

image

Here are also more numbers of the same code running against a real Azure CosmosDB instance.

image

ausfeldt commented 6 years ago

@Elfocrash Thanks for this. Looking into it.

ausfeldt commented 6 years ago

There are 2 things affecting performance between the versions 1.9.1 and 2.0.0.

1.) Library projects targeting .NET Standard or .NET Core will not have their dependencies copied to the bin. The dependencies will generally be loaded from the packages folder via the deps.json file. To get performance optimization on Windows x64, native assemblies are used (Microsoft.Azure.Documents.ServiceInterop.dll and DocumentDB.Spatial.Sql.dll) and will be copied to your bin. These native assemblies need to be alongside Microsoft.Azure.DocumentDB.Core.dll. Because Microsoft.Azure.DocumentDB.Core.dll is loaded from the packages folder instead of the bin, they are not loaded. Adding <CopyLocalLockFileAssemblies>true</CopyLocalLockFileAssemblies> will make sure that Microsoft.Azure.DocumentDB.Core.dll copied to your bin and you will see a performance increase. You can also add Environment.SetEnvironmentVariable("DisableSkipInterop", "1"); to your code instead which will also fix this issue.

2.) The networking stack was updated to reduce the number of connections establish to the backend. This was achieved though multiplexing. Making all the request in a single connection instead of multiple connections for every request. The emulator has just a single replica, so all the requests are going through a single connection, where as a production endpoint has more than one replica and there will be a connection for each replica. In next SDK release there will be multiplexing performance improvements on a single replica.

Elfocrash commented 6 years ago

Thanks for looking into this. However I later updated my issue with an image with results for an actual CosmosDB service running in the cloud and I expect there are more than 1 replicas in the cloud.

What about this one?

Elfocrash commented 6 years ago

Also on top of that, I tested the same thing in the following scenario just to make sure my observation is valid.

CosmosDB database account replicated in EUN and EUW. Two collections, one partitioned and one not partitioned.

I ran the sample code against those collections for 1 minute. It was a set of any type of CRUD operations.

Here are the stats I collected:

1.9.1 processed 11700 documents. 2.0.0 processed 6800 documents. (processed means both created, queried, read, updated and removed)

This is almost half the performance.

The single replica scenario theory might be a valid one but it isn't the only one.

ausfeldt commented 6 years ago

@Elfocrash Have you addressed the first issue first as it is the bulk of the the performance hit. Try adding <CopyLocalLockFileAssemblies>true</CopyLocalLockFileAssemblies> to your application's proj file or add Environment.SetEnvironmentVariable("DisableSkipInterop", "1"); to your code.

Elfocrash commented 6 years ago

I will give that a go and get back to you but if it is missing shouldn’t that make it equally bad or equally good between the different versions as they target the same frameworks and they run on the same code? Am I missing something?

ausfeldt commented 6 years ago

In 2.0.0 there is logic to gracefully fallback if the native assemblies are not found next to Microsoft.Azure.DocumentDB.Core.dll. Which will result in the trade off of worse performance instead of not working at all (throwing a dll not found error). Because of deps.json, the assembly is loaded from the packages folder and the native assemblies are not along side it, but deps.json is aware that they are in the runtimes folder of the package. To work around this scenario which the fallback does not handle. You can either add <CopyLocalLockFileAssemblies>true</CopyLocalLockFileAssemblies> to your application's project file, which will copy Microsoft.Azure.DocumentDB.Core.dll to your bin folder, satisfying the fallback logic, or add Environment.SetEnvironmentVariable("DisableSkipInterop", "1"); to your code, which will disable the fallback logic.

Elfocrash commented 6 years ago

Ah that makes more sense.

Ok I tried enabling the flag and indeed the Document.Core.dll was copied into the bin folder.

I ran the same sample code with both partitioned and non partitioned collections on both Emulator (rate limiting disabled) and the real CosmosDB instance which is replicated in two regions and provisioned with 1000 RU/s for each collection.

Here are the results

Emulator data 1.9.1 Processed 46900 documents in one minute with CopyLocalLockFileAssemblies set to true

1.9.1 Processed 46900 documents in one minute with CopyLocalLockFileAssemblies set to false

2.0.0 Processed 19200 documents in one minute with CopyLocalLockFileAssemblies set to true

2.0.0 Processed 18200 documents in one minute with CopyLocalLockFileAssemblies set to false

Real CosmosDB Instance data

1.9.1 Processed 14900 documents in one minute with CopyLocalLockFileAssemblies set to true

1.9.1 Processed 14900 documents in one minute with CopyLocalLockFileAssemblies set to false

2.0.0 Processed 7900 documents in one minute with CopyLocalLockFileAssemblies set to true

2.0.0 Processed 7300 documents in one minute with CopyLocalLockFileAssemblies set to false

(The tests were run repeatedly and these values are averages)

Even though the flag gives a 7.5% performance boost on the real service it is still lacking behind a 47% from 1.9.1.

ausfeldt commented 6 years ago

Can also you try adding Environment.SetEnvironmentVariable("DisableSkipInterop", "1"); I want to ensure the fallback logic is not ignoring the native assemblies.

Elfocrash commented 6 years ago

Added Environment.SetEnvironmentVariable("DisableSkipInterop", "1"); at the top of my Main() method of the program and the results average to the same numbers.

Elfocrash commented 6 years ago

@ausfeldt Sorry, when I said that the results average to the same numbers i meant to the previous bad numbers. I meant that the change doesn't improve the performance.

Real CosmosDB Instance data 1.9.1 Processed 14900 documents in one minute with CopyLocalLockFileAssemblies set to true 1.9.1 Processed 14900 documents in one minute with CopyLocalLockFileAssemblies set to false 2.0.0 Processed 7900 documents in one minute with CopyLocalLockFileAssemblies set to true 2.0.0 Processed 7300 documents in one minute with CopyLocalLockFileAssemblies set to false

ausfeldt commented 6 years ago

@Elfocrash Using your sample code I was able not able to repro the issue. But when provisioning collection with 1000 RUs instead of 5000 RUs or greater, I was able to reproduce the issue. In next SDK release there will be multiplexing performance improvements. I'll update this thread if those improvements fix this issue, else more investigations is needed. Thanks.