Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
737 stars 494 forks source link

ReadNextAsync goes crazy and load all documents from collection, against executing query. #2313

Closed darthkurak closed 2 years ago

darthkurak commented 3 years ago

Hi everyone. I have nasty case for you, which we observe on our production code. It's random, it's crazy and we are out of ideas why this happening. We need help from you as it seems to have source inside SDK itself. Take a seat and enjoy.

  1. Executed query: SELECT TOP 10 VALUE root FROM root WHERE ((true OR (root["User"]["IsDeleted"] = false)) AND StringEquals(root["User"]["AllegroData"]["Email"], "somestandard_nothingspecial@email.here", true)) ORDER BY root["_ts"] DESC

  2. Executed in portal: image

  3. Executed on our production: image

FeedIterator.ToListAsync: PageCount: "1", RuUsage: "1971.7099999999998", PageNumber: 0 FeedIterator.ToListAsync: PageCount: 0, RuUsage: 46456.38, PageNumber: 1 FeedIterator make more than one roundtrip! RuUsage: 48428.09, Query: "query from point 1"

So, one ReadNextAsync call, produce most of this dependencies, resulting with 46k RuUsage.

  1. This wrong execution, has sum of dependencies on Metric-retrievedDocumentCount equal documents count in collection - it means, all documents from collection are loaded on cosmo server. On other hand, Metric-outputDocumentCount, has always one or zero results (as expected from query) image

  2. This happens randomly, couple times at day, always for the same query, but different parameters. image

  3. It takes time (see picture from point 2)

  4. SDK versions Microsoft.Azure.Cosmos (3.16) Microsoft.Azure.DocumentDB.Core (2.13.1)

  5. ToListAsync code:

public static async Task<List<T>> ToListAsync<T>(this FeedIterator<T> query, ILogger logger = null, string queryString = null)
        {
            var results = new List<T>();
            double ruUsage = 0;
            var count = 0;
            string firstPageCount = "";
            string firstRu = "";
            while (query.HasMoreResults)
            {
                var page = await query.ReadNextAsync();
                results.AddRange(page);
                ruUsage += page.RequestCharge;
                if (count == 0)
                {
                    firstPageCount = page.Count.ToString();
                    firstRu = page.RequestCharge.ToString(CultureInfo.InvariantCulture);
                }
                if (count > 0)
                {
                    if (count == 1)
                    {
                        logger?.Information(
                            "FeedIterator.ToListAsync: PageCount: {pageCount}, RuUsage: {ruUsage}, PageNumber: {count}",
                            firstPageCount, firstRu, 0);
                    }
                    logger?.Information(
                        "FeedIterator.ToListAsync: PageCount: {pageCount}, RuUsage: {ruUsage}, PageNumber: {count}",
                        page.Count, page.RequestCharge, count);
                }
                count++;
            }

            if (count > 1)
            {
                logger?.Warning("FeedIterator make more than one roundtrip! RuUsage: {ruUsage}, Query: {queryString}", ruUsage, queryString);
            }

            return results;
        }

Do you have some ideas why this happening? It is possible that something is wrong with StringEquals function? Some edge case where it doesn't work properly?

j82w commented 3 years ago

Hi @darthkurak , can you try upgrading to 3.17? There is the following fix which I believe will address your problem.

If you still see the issue can you please capture the page.Diagnostics.ToString(). It has all the necessary information to root cause different latency and query issues.

Fyi: @timsander1

darthkurak commented 3 years ago

We did page.Diagnostics.ToString and it blowed up our production πŸ—‘οΈ

Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct. - probably something for separate thread ;)

Will let you know about SDK update.

j82w commented 3 years ago

This PR seems to fix the concurrent bug. https://github.com/Azure/azure-cosmos-dotnet-v3/pull/2314 Is it still broken in 3.17?

darthkurak commented 3 years ago

Looks like it. We made the change at the same time. We bumped up the SDK and added diagnostics to the logs.

darthkurak commented 3 years ago

Even on this PR it looks like it wasn't completed.

ealsur commented 3 years ago

@bchong95 Could you comment on the concurrency issue on the query diagnostics? Based on https://github.com/Azure/azure-cosmos-dotnet-v3/pull/2097#discussion_r571185998 it looks like this is not behaving as expected?

ealsur commented 3 years ago

@darthkurak If you have the full stack trace of the concurrency error, it would greatly help to debug or try to reproduce it.

darthkurak commented 3 years ago

@ealsur Full stack trace of concurrency error:

{"assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","method":"System.ThrowHelper.ThrowInvalidOperationException_ConcurrentOperationsNotSupported","level":0,"line":0}
{"assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","method":"System.Collections.Generic.Dictionary`2.FindEntry","level":1,"line":0}
{"assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","method":"System.Collections.Generic.Dictionary`2.TryGetValue","level":2,"line":0}
{"assembly":"Microsoft.Azure.Cosmos.Client, Version=3.17.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35","method":"Microsoft.Azure.Cosmos.Tracing.TraceWriter.GetFileNameFromPath","level":3,"line":0}
{"assembly":"Microsoft.Azure.Cosmos.Client, Version=3.17.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35","method":"Microsoft.Azure.Cosmos.Tracing.TraceWriter+TraceJsonWriter.WriteTrace","level":4,"line":0}
{"assembly":"Microsoft.Azure.Cosmos.Client, Version=3.17.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35","method":"Microsoft.Azure.Cosmos.Tracing.TraceWriter+TraceJsonWriter.WriteTrace","level":5,"line":0}
{"assembly":"Microsoft.Azure.Cosmos.Client, Version=3.17.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35","method":"Microsoft.Azure.Cosmos.Tracing.TraceWriter.WriteTrace","level":6,"line":0}
{"assembly":"Microsoft.Azure.Cosmos.Client, Version=3.17.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35","method":"Microsoft.Azure.Cosmos.Diagnostics.CosmosTraceDiagnostics.WriteTraceToJsonWriter","level":7,"line":0}
{"assembly":"Microsoft.Azure.Cosmos.Client, Version=3.17.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35","method":"Microsoft.Azure.Cosmos.Diagnostics.CosmosTraceDiagnostics.ToJsonString","level":8,"line":0}
{"assembly":"Microsoft.Azure.Cosmos.Client, Version=3.17.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35","method":"Microsoft.Azure.Cosmos.Diagnostics.CosmosTraceDiagnostics.ToString","level":9,"line":0}
{"assembly":"FinAi.CustomerService, Version=0.47.1.0, Culture=neutral, PublicKeyToken=null","method":"FinAi.CustomerService.Extensions.CosmosExtensions+<ToListAsync>d__0`1.MoveNext","level":10,"line":28,"fileName":"/workspace/19/s/src/FinAi.CustomerService/Extensions/CosmosExtensions.cs"}
{"assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":11,"line":0}
{"assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","method":"System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess","level":12,"line":0}
{"assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":13,"line":0}
{"assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","method":"System.Runtime.CompilerServices.TaskAwaiter`1.GetResult","level":14,"line":0}
{"assembly":"FinAi.CustomerService, Version=0.47.1.0, Culture=neutral, PublicKeyToken=null","method":"FinAi.CustomerService.Services.UserSecretsStorage+<GetUserByPredicate>d__39.MoveNext","level":15,"line":630,"fileName":"/workspace/19/s/src/FinAi.CustomerService/Services/UserSecretsStorage.cs"}

@j82w We did update to 3.17, but still observe the same issue. From last 7 days, we have 36 occurrences. 1129 requests where made total which could lead to this problematic query (we can't be sure how many exactly execution of this particular query were, because we are logging only problematic queries)
Of course, we have confirmation that exactly same query executed later by user, or later by us, produce correct results and do not drain whole database.

j82w commented 3 years ago

@darthkurak can you please contact support so a support ticket is created and will ensure the issue is properly prioritized.

darthkurak commented 3 years ago

I'v created ticket. How can i pass you request ID?

bartelink commented 3 years ago

Anecdotal: I don't have any analytics at the level using 3.17.0 but am observing per degradation over time. More significantly, calls start to freeze after about 2.5h of running within an app that has 2 V3 clients (pointed at different containers) and a V2 CFP client

I've deployed 3.18.0-preview and it doesn't fix it (~can anyone tell me whether the 3.17.1 fixes are included in that please, given 3.17.1 does not seem to be on nuget as yet?~ I assume the 3.17.1 fixes are in the preview per rel notes).

I've good confidence this hang and perf degradation was not happening when using V2. πŸ€” the problem is I'll need to backport a lib to v2 to prove that, and that's not happening this week or next.

j82w commented 3 years ago

@darthkurak I was able to find the ticket. Can you try using 3.18.0-preview and capture the CosmosDiagnostics? It fixes the dictionary concurrency issue. There will be a 3.17.1 release later today which will also contain the fix.

@bartelink you are correct about the fixes. Here is the PR to update the changelog. Is there any way you can share a repro of the issue? Is it a memory leak?

bartelink commented 3 years ago

@j82w Thanks. Just to be clear: my 'report' should be considered wild speculation for now.

I'm not currently running the ingestion process so can't even chart memory growth atm.

The other thing that's impeding my progress toward doing proper issue reporting is that I'm having to mix V2 CFP usage with usage of the V3 client as we wait on https://github.com/Azure/azure-cosmos-dotnet-v3/issues/1765

darthkurak commented 3 years ago

Hi. We have used 3.17.1 and gathered logs.

FeedIterator.ToListAsync more than one page: PageCount: "1", RuUsage: "15191.44", PageNumber: 0
--
FeedIterator.ToListAsync more than one page: PageDiagnostics 0: "{\"name\":\"Typed FeedIterator ReadNextAsync\",\"id\":\"e5ba9312-74cc-4f4e-8beb-30c529bdf5ff\",\"component\":\"Transport\",\"caller info\":{\"member\":\"OperationHelperWithRootTraceAsync\",\"file\":\"ClientContextCore.cs\",\"line\":219},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":72098.5328,\"data\":{},\"children\":[{\"name\":\"Create Query Pipeline\",\"id\":\"e7cff711-6be9-4b47-a98a-0a515714156d\",\"component\":\"Query\",\"caller info\":{\"member\":\"TryCreateCoreContextAsync\",\"file\":\"CosmosQueryExecutionContextFactory.cs\",\"line\":85},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":3.3478,\"data\":{},\"children\":[{\"name\":\"Get Container Properties\",\"id\":\"e576a4ff-88d6-4b7a-99e2-27920fd310f1\",\"component\":\"Transport\",\"caller info\":{\"member\":\"GetCachedContainerPropertiesAsync\",\"file\":\"ClientContextCore.cs\",\"line\":363},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":0.015,\"data\":{},\"children\":[{\"name\":\"Get Collection Cache\",\"id\":\"35fae363-d249-472e-a6dc-c7dbd28bf3a9\",\"component\":\"Routing\",\"caller info\":{\"member\":\"GetCollectionCacheAsync\",\"file\":\"DocumentClient.cs\",\"line\":542},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":0.0023,\"data\":{},\"children\":[]}]},{\"name\":\"Gateway QueryPlan\",\"id\":\"b791abef-6e90-4912-848d-2af904e1a772\",\"component\":\"Query\",\"caller info\":{\"member\":\"GetQueryPlanThroughGatewayAsync\",\"file\":\"QueryPlanRetriever.cs\",\"line\":112},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":0.4665,\"data\":{},\"children\":[]},{\"name\":\"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler\",\"id\":\"2be847b6-c96e-463f-8feb-40c58e1a83e9\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestInvokerHandler.cs\",\"line\":148},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":3.1101,\"data\":{},\"children\":[{\"name\":\"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler\",\"id\":\"f19bff7d-63e4-4843-8247-0089d058d65b\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestInvokerHandler.cs\",\"line\":51},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":3.0981,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"c858da03-2dc4-40f9-914a-fd128ae20202\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":3.0919,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"f0bb0654-47f9-44a2-b657-985dcb5b2ab5\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":2.9771,\"data\":{\"CPU Load History\":{\"CPU History\":\"(2021-03-24T18:05:24.6185734Z 36.207), (2021-03-24T18:05:34.6198710Z 76.364), (2021-03-24T18:05:44.6153026Z 34.104), (2021-03-24T18:05:44.6190542Z 100.000), (2021-03-24T18:05:54.6182511Z 57.714), (2021-03-24T18:06:04.6197546Z 50.581)\"}},\"children\":[{\"name\":\"Send Async\",\"id\":\"fc3c9cb4-013e-4c0f-811e-27dbad7e269a\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":2.9727,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"5f324284-e13e-4b5b-b3e5-74135feadbd5\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"AbstractRetryHandler.cs\",\"line\":23},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":2.9703,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"00a8b457-1021-44de-9841-a9d67b75b83d\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":2.9592,\"data\":{},\"children\":[{\"name\":\"Microsoft.Azure.Cosmos.GatewayStoreModel Transport Request\",\"id\":\"d2d0f0aa-53a9-4331-95ff-131cf63dbe72\",\"component\":\"Transport\",\"caller info\":{\"member\":\"ProcessMessageAsync\",\"file\":\"TransportHandler.cs\",\"line\":87},\"start time\":\"06:06:09:093\",\"duration in milliseconds\":2.9001,\"data\":{\"User Agent\":\"cosmos-netstandard-sdk/3.17.1|3.17.1|03|X64|Linux 4.15.0-1106-azure 118 1|.NET Core 3.1.12|\",\"Client Side Request Stats\":{\"Id\":\"AggregatedClientSideRequestStatistics\",\"ContactedReplicas\":[],\"RegionsContacted\":[],\"FailedReplicas\":[],\"AddressResolutionStatistics\":[],\"StoreResponseStatistics\":[]},\"PointOperationStatisticsTraceDatum\":{\"Id\":\"PointOperationStatistics\",\"ActivityId\":null,\"ResponseTimeUtc\":\"2021-03-24T18:06:09.0963334Z\",\"StatusCode\":200,\"SubStatusCode\":0,\"RequestCharge\":0,\"RequestUri\":\"dbs/Database/colls/UserSecrets\",\"ErrorMessage\":null,\"RequestSessionToken\":null,\"ResponseSessionToken\":null}},\"children\":[]}]}]}]}]}]}]}]},{\"name\":\"Get Partition Key Ranges\",\"id\":\"9c06f651-effc-439b-8925-cd25fc0e2f53\",\"component\":\"Routing\",\"caller info\":{\"member\":\"GetTargetPartitionKeyRangesAsync\",\"file\":\"CosmosQueryClientCore.cs\",\"line\":240},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":0.0545,\"data\":{},\"children\":[{\"name\":\"Try Get Overlapping Ranges\",\"id\":\"d806fae9-b6d1-4ec5-a4de-6c7a976d93a9\",\"component\":\"Routing\",\"caller info\":{\"member\":\"TryGetOverlappingRangesAsync\",\"file\":\"PartitionKeyRangeCache.cs\",\"line\":49},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":0.04,\"data\":{},\"children\":[]}]}]},{\"name\":\"Prefetching\",\"id\":\"4dfd12b1-2139-4df7-8e83-273c499656c0\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"PrefetchInParallelAsync\",\"file\":\"ParallelPrefetch.cs\",\"line\":31},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":0.0013,\"data\":{},\"children\":[]},{\"name\":\"[,05C1D0) move next\",\"id\":\"b575ab4f-6d90-4ff1-a986-574349ddf014\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"MoveNextAsync\",\"file\":\"PartitionRangePageAsyncEnumerator.cs\",\"line\":49},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.9171,\"data\":{},\"children\":[{\"name\":\"[,05C1D0) move next\",\"id\":\"30475c77-675f-4851-88e7-be994edb484e\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"MoveNextAsync\",\"file\":\"PartitionRangePageAsyncEnumerator.cs\",\"line\":49},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.9128,\"data\":{},\"children\":[{\"name\":\"Prefetch\",\"id\":\"45bb506a-f43c-4157-94bc-f70d958b15df\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"PrefetchAsync\",\"file\":\"BufferedPartitionRangePageAsyncEnumerator.cs\",\"line\":54},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.9068,\"data\":{},\"children\":[{\"name\":\"[,05C1D0) move next\",\"id\":\"8d9faa4a-1d4b-4c0c-bfb6-9a12b6e37e91\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"MoveNextAsync\",\"file\":\"PartitionRangePageAsyncEnumerator.cs\",\"line\":49},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.7497,\"data\":{\"Query Metrics\":\"Retrieved Document Count                 :               0             \nRetrieved Document Size                  :               0 bytes       \nOutput Document Count                    :               0             \nOutput Document Size                     :              49 bytes       \nIndex Utilization                        :            0.00 %           \nTotal Query Execution Time               :            0.77 milliseconds\n  Query Preparation Time                 :            0.48 milliseconds\n  Index Lookup Time                      :            0.07 milliseconds\n  Document Load Time                     :            0.00 milliseconds\n  Runtime Execution Times                :            0.00 milliseconds\n  Document Write Time                    :            0.00 milliseconds\n\nIndex Utilization Information\n  Utilized Single Indexes\n  Potential Single Indexes\n  Utilized Composite Indexes\n  Potential Composite Indexes\n\"},\"children\":[{\"name\":\"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler\",\"id\":\"76fe72fc-d1ea-47f3-a0cd-9a37bb00f215\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestInvokerHandler.cs\",\"line\":148},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.5092,\"data\":{},\"children\":[{\"name\":\"Get Partition Key Range Cache\",\"id\":\"5102a7c9-f1d0-434b-bbde-f4d112e00023\",\"component\":\"Routing\",\"caller info\":{\"member\":\"GetPartitionKeyRangeCacheAsync\",\"file\":\"DocumentClient.cs\",\"line\":551},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":0.001,\"data\":{},\"children\":[]},{\"name\":\"Get Collection Cache\",\"id\":\"a25e4529-59fb-4f75-86c6-b1fc34f55e78\",\"component\":\"Routing\",\"caller info\":{\"member\":\"GetCollectionCacheAsync\",\"file\":\"DocumentClient.cs\",\"line\":542},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":0.0003,\"data\":{},\"children\":[]},{\"name\":\"Try Get Overlapping Ranges\",\"id\":\"4bb61bdf-8448-4b26-bc77-9587258621e4\",\"component\":\"Routing\",\"caller info\":{\"member\":\"TryGetOverlappingRangesAsync\",\"file\":\"PartitionKeyRangeCache.cs\",\"line\":49},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":0.007,\"data\":{},\"children\":[]},{\"name\":\"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler\",\"id\":\"0a81297b-b543-4d27-992c-05abd1b4ce52\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestInvokerHandler.cs\",\"line\":51},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.4734,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"33579de3-2be0-4e6a-9df1-b89681d30303\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.4664,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"e5226973-28c9-49a5-a9c8-a4ce2f41608e\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.3657,\"data\":{\"CPU Load History\":{\"CPU History\":\"(2021-03-24T18:05:24.6185734Z 36.207), (2021-03-24T18:05:34.6198710Z 76.364), (2021-03-24T18:05:44.6153026Z 34.104), (2021-03-24T18:05:44.6190542Z 100.000), (2021-03-24T18:05:54.6182511Z 57.714), (2021-03-24T18:06:04.6197546Z 50.581)\"}},\"children\":[{\"name\":\"Send Async\",\"id\":\"04438472-ca49-4375-ad8f-d249c929f838\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.3624,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"4706de6b-a7cd-4909-978e-07b758d1ce05\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"AbstractRetryHandler.cs\",\"line\":23},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.3603,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"daa6a88d-3019-484d-823e-5a151c3017a3\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.355,\"data\":{},\"children\":[{\"name\":\"Microsoft.Azure.Documents.ServerStoreModel Transport Request\",\"id\":\"2ac97370-fcf8-4142-ad9a-8365abf6623c\",\"component\":\"Transport\",\"caller info\":{\"member\":\"ProcessMessageAsync\",\"file\":\"TransportHandler.cs\",\"line\":87},\"start time\":\"06:06:09:096\",\"duration in milliseconds\":3.321,\"data\":{\"User Agent\":\"cosmos-netstandard-sdk/3.17.1|3.17.1|03|X64|Linux 4.15.0-1106-azure 118 1|.NET Core 3.1.12|\",\"Client Side Request Stats\":{\"Id\":\"AggregatedClientSideRequestStatistics\",\"ContactedReplicas\":[],\"RegionsContacted\":[\"https://custsrv-db-vabank-xyz-westeurope.documents.azure.com/\"],\"FailedReplicas\":[],\"AddressResolutionStatistics\":[],\"StoreResponseStatistics\":[{\"ResponseTimeUTC\":\"2021-03-24T18:06:09.1002340Z\",\"ResourceType\":\"Document\",\"OperationType\":\"Query\",\"LocationEndpoint\":\"https://custsrv-db-vabank-xyz-westeurope.documents.azure.com/\",\"ActivityId\":\"5856c533-aa86-4f20-b4ec-ed6d23a3b058\",\"StoreResult\":\"StorePhysicalAddress: rntbd://custsrv-db-vabank-xyz-westeurope.documents.azure.com:27632/apps/110d63ae-611a-40ca-a4dc-faca491405a5/services/67675117-7134-435a-9df6-3a9f09093742/partitions/fe66ad53-8686-4e0c-b00e-9a8c344b87b3/replicas/132606337350411727p/, LSN: 35443085, GlobalCommittedLsn: 35443085, PartitionKeyRangeId: 3, IsValid: True, StatusCode: 200, SubStatusCode: 0, RequestCharge: 2.79, ItemLSN: -1, SessionToken: -1#35443085, UsingLocalLSN: True, TransportException: null\"}]}},\"children\":[]}]}]}]}]}]}]}]},{\"name\":\"Get Cosmos Element Response\",\"id\":\"fb397488-019b-4f45-8b1c-35452debde7f\",\"component\":\"Json\",\"caller info\":{\"member\":\"GetCosmosElementResponse\",\"file\":\"CosmosQueryClientCore.cs\",\"line\":282},\"start time\":\"06:06:09:100\",\"duration in milliseconds\":0.1223,\"data\":{},\"children\":[]}]}]}]}]},{\"name\":\"POCO Materialization\",\"id\":\"58888cda-249f-4b9f-a84c-ce43b2efd396\",\"component\":\"Poco\",\"caller info\":{\"member\":\"ReadNextAsync\",\"file\":\"FeedIteratorCore.cs\",\"line\":247},\"start time\":\"06:07:21:191\",\"duration in milliseconds\":0.1079,\"data\":{},\"children\":[]}]}"
--
FeedIterator.ToListAsync more than one page: PageCount: 0, RuUsage: 39391.058999999994, PageNumber: 1
--
FeedIterator.ToListAsync more than one page: PageDiagnostics 1: "{\"name\":\"Typed FeedIterator ReadNextAsync\",\"id\":\"31adb209-b0e5-43a5-958a-25a96963fe3e\",\"component\":\"Transport\",\"caller info\":{\"member\":\"OperationHelperWithRootTraceAsync\",\"file\":\"ClientContextCore.cs\",\"line\":219},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":59751.2997,\"data\":{},\"children\":[{\"name\":\"Prefetching\",\"id\":\"2d5f6be2-9e72-4a38-9426-fb32775ef75a\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"PrefetchInParallelAsync\",\"file\":\"ParallelPrefetch.cs\",\"line\":31},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":0.0016,\"data\":{},\"children\":[]},{\"name\":\"[05C1D0,05C1E0) move next\",\"id\":\"3084fa0e-bf83-470c-8486-73dbaf116076\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"MoveNextAsync\",\"file\":\"PartitionRangePageAsyncEnumerator.cs\",\"line\":49},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4501.1199,\"data\":{},\"children\":[{\"name\":\"[05C1D0,05C1E0) move next\",\"id\":\"ccbfb27e-ba2a-4561-a01b-3993499b5ecf\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"MoveNextAsync\",\"file\":\"PartitionRangePageAsyncEnumerator.cs\",\"line\":49},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4501.1151,\"data\":{},\"children\":[{\"name\":\"Prefetch\",\"id\":\"ace18873-f69a-4cae-999c-61c531200449\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"PrefetchAsync\",\"file\":\"BufferedPartitionRangePageAsyncEnumerator.cs\",\"line\":54},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4501.1094,\"data\":{},\"children\":[{\"name\":\"[05C1D0,05C1E0) move next\",\"id\":\"bad72e5e-ca70-49c2-aa1f-e2f433c9c96d\",\"component\":\"Pagination\",\"caller info\":{\"member\":\"MoveNextAsync\",\"file\":\"PartitionRangePageAsyncEnumerator.cs\",\"line\":49},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4501.1046,\"data\":{\"Query Metrics\":\"Retrieved Document Count                 :          46,027             \nRetrieved Document Size                  :      47,178,918 bytes       \nOutput Document Count                    :               0             \nOutput Document Size                     :              49 bytes       \nIndex Utilization                        :            0.00 %           \nTotal Query Execution Time               :        4,496.98 milliseconds\n  Query Preparation Time                 :            0.73 milliseconds\n  Index Lookup Time                      :          283.03 milliseconds\n  Document Load Time                     :        4,077.54 milliseconds\n  Runtime Execution Times                :           20.07 milliseconds\n  Document Write Time                    :            0.00 milliseconds\n\nIndex Utilization Information\n  Utilized Single Indexes\n  Potential Single Indexes\n  Utilized Composite Indexes\n  Potential Composite Indexes\n\"},\"children\":[{\"name\":\"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler\",\"id\":\"ca8d57ab-6e0a-4738-8bce-02faae8e2d2c\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestInvokerHandler.cs\",\"line\":148},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4500.8872,\"data\":{},\"children\":[{\"name\":\"Get Partition Key Range Cache\",\"id\":\"d5aad975-01d5-4d40-8c53-e4ed91067185\",\"component\":\"Routing\",\"caller info\":{\"member\":\"GetPartitionKeyRangeCacheAsync\",\"file\":\"DocumentClient.cs\",\"line\":551},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":0.0008,\"data\":{},\"children\":[]},{\"name\":\"Get Collection Cache\",\"id\":\"88b6c9be-b1c5-49e2-b1cf-e5500cb20b47\",\"component\":\"Routing\",\"caller info\":{\"member\":\"GetCollectionCacheAsync\",\"file\":\"DocumentClient.cs\",\"line\":542},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":0.0005,\"data\":{},\"children\":[]},{\"name\":\"Try Get Overlapping Ranges\",\"id\":\"e66eb5bc-fb8a-4f2f-9a70-c31682398b61\",\"component\":\"Routing\",\"caller info\":{\"member\":\"TryGetOverlappingRangesAsync\",\"file\":\"PartitionKeyRangeCache.cs\",\"line\":49},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":0.0117,\"data\":{},\"children\":[]},{\"name\":\"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler\",\"id\":\"ed281da7-47bc-4497-92ad-0a628d2430fa\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestInvokerHandler.cs\",\"line\":51},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4500.8402,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"f3ab0f73-6cae-47af-ae81-e3d7b1b43a0e\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4500.8332,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"104e68af-2e8e-4b73-849a-38eb03cd4866\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4500.6289,\"data\":{\"CPU Load History\":{\"CPU History\":\"(2021-03-24T18:06:24.6182303Z 86.982), (2021-03-24T18:06:34.6185512Z 62.857), (2021-03-24T18:06:44.6219761Z 89.535), (2021-03-24T18:06:54.6251765Z 38.462), (2021-03-24T18:07:04.6179766Z 87.209), (2021-03-24T18:07:14.6182500Z 16.766)\"}},\"children\":[{\"name\":\"Send Async\",\"id\":\"5cf83da4-b6ad-4b96-a83d-9c039cd551c9\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4500.6255,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"68662b53-bf9b-478c-9f16-7c6fd4db719c\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"AbstractRetryHandler.cs\",\"line\":23},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4500.6232,\"data\":{},\"children\":[{\"name\":\"Send Async\",\"id\":\"2133f6a4-cc62-4587-9c77-6ecbc07e329b\",\"component\":\"RequestHandler\",\"caller info\":{\"member\":\"SendAsync\",\"file\":\"RequestHandler.cs\",\"line\":50},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4500.6151,\"data\":{},\"children\":[{\"name\":\"Microsoft.Azure.Documents.ServerStoreModel Transport Request\",\"id\":\"5bb38962-1c9a-4717-9546-fbdbcbbc6676\",\"component\":\"Transport\",\"caller info\":{\"member\":\"ProcessMessageAsync\",\"file\":\"TransportHandler.cs\",\"line\":87},\"start time\":\"06:07:21:194\",\"duration in milliseconds\":4500.5884,\"data\":{\"User Agent\":\"cosmos-netstandard-sdk/3.17.1|3.17.1|03|X64|Linux 4.15.0-1106-azure 118 1|.NET Core 3.1.12|\",\"Client Side Request Stats\":{\"Id\":\"AggregatedClientSideRequestStatistics\",\"ContactedReplicas\":[],\"RegionsContacted\":[\"https://custsrv-db-vabank-xyz-westeurope.documents.azure.com/\"],\"FailedReplicas\":[],\"AddressResolutionStatistics\":[],\"StoreResponseStatistics\":[{\"ResponseTimeUTC\":\"2021-03-24T18:07:25.6948624Z\",\"ResourceType\":\"Document\",\"OperationType\":\"Query\",\"LocationEndpoint\":\"https://custsrv-db-vabank-xyz-westeurope.documents.azure.com/\",\"ActivityId\":\"7c7af65b-9938-4f91-8c89-ac587b071378\",\"StoreResult\":\"StorePhysicalAddress: rntbd://custsrv-db-vabank-xyz-westeurope.documents.azure.com:27414/apps/110d63ae-611a-40ca-a4dc-faca491405a5/services/d90085ea-f2e1-45b7-be42-2abd76e0c190/partitions/f2449273-b9e3-4e82-a20b-df342442c7b0/replicas/132606076152988533s/, LSN: 36277801, GlobalCommittedLsn: 36277800, PartitionKeyRangeId: 4, IsValid: True, StatusCode: 200, SubStatusCode: 0, RequestCharge: 1650.51, ItemLSN: -1, SessionToken: -1#36277801, UsingLocalLSN: True, TransportException: null\"}]}},\"children\":[]}]}]}]}]}]}]}]},{\"name\":\"Get Cosmos Element Response\",\"id\":\"579de056-8b85-49dd-90f8-43411dc42983\",\"component\":\"Json\",\"caller info\":{\"member\":\"GetCosmosElementResponse\",\"file\":\"CosmosQueryClientCore.cs\",\"line\":282},\"start time\":\"06:07:25:695\",\"duration in milliseconds\":0.1054,\"data\":{},\"children\":[]}]}]}]}]},{\"name\":\"POCO Materialization\",\"id\":\"7f191c63-203c-4fc0-bfc2-2fbe35a42f9d\",\"component\":\"Poco\",\"caller info\":{\"member\":\"ReadNextAsync\",\"file\":\"FeedIteratorCore.cs\",\"line\":247},\"start time\":\"06:08:20:945\",\"duration in milliseconds\":0.0322,\"data\":{},\"children\":[]}]}"
--
FeedIterator make more than one roundtrip! RuUsage: 54582.498999999996, Query: "{\"query\":\"SELECT TOP 10 VALUE root FROM root WHERE ((true OR (root[\\"User\\"][\\"IsDeleted\\"] = false)) AND StringEquals(root[\\"User\\"][\\"AllegroData\\"][\\"Email\\"], \\"some@mail\\", true)) ORDER BY root[\\"_ts\\"] DESC\"}"

From App Insights it looked liked this: image Calling first time ReadNextAsync 16 dependencies was registered in App Insights And taking next page (next ReadNextAsync) 15 dependencies was registered. So, in overall - CosmosDb was called 31 times, for single, select top 10 query.

j82w commented 3 years ago

Query: "SELECT TOP 10 VALUE root FROM root WHERE ((true OR (root[\"User\"][\"IsDeleted\"] = false)) AND StringEquals(root[\"User\"][\"AllegroData\"][\"Email\"], \"some@mail\", true)) ORDER BY root[\"_ts\"] DESC\"}

Here is the diagnostics formatted. I only see the SDK doing 2 page requests in direct mode.

root|root
└── Typed FeedIterator ReadNextAsync|121.3821
    β”œβ”€β”€ MoveNextAsync|66.4455
    β”‚   └── [,05C1CFFFFFFFF8) move next|66.3757
    β”‚       └── Prefetch|66.3639
    β”‚           └── [,05C1CFFFFFFFF8) move next|66.3568 | Retrieved Document Count                 :             250             
Retrieved Document Size                  :       2,584,597 bytes       
Output Document Count                    :             250             
Output Document Size                     :         920,129 bytes       
Index Utilization                        :          100.00 %           
Total Query Execution Time               :           12.31 milliseconds
  Query Preparation Time                 :            0.57 milliseconds
  Index Lookup Time                      :            0.01 milliseconds
  Document Load Time                     :            8.18 milliseconds
  Runtime Execution Times                :            0.00 milliseconds
  Document Write Time                    :            1.43 milliseconds

Index Utilization Information
  Utilized Single Indexes
  Potential Single Indexes
  Utilized Composite Indexes
  Potential Composite Indexes

    β”‚               β”œβ”€β”€ Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler|28.8377
    β”‚               β”‚   β”œβ”€β”€ Get Collection Cache|0.0004
    β”‚               β”‚   β”œβ”€β”€ Try Get Overlapping Ranges|0.0226
    β”‚               β”‚   └── Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler|28.7178
    β”‚               β”‚       └── Send Async|28.6794
    β”‚               β”‚           └── Send Async|28.6736
    β”‚               β”‚               └── Send Async|28.672
    β”‚               β”‚                   └── Send Async|28.6448
    β”‚               β”‚                       └── Microsoft.Azure.Documents.ServerStoreModel Transport Request|28.5822
    β”‚               └── Get Cosmos Element Response|37.3652
    └── POCO Materialization|53.9377
root|root
└── Typed FeedIterator ReadNextAsync|59751.2997
    β”œβ”€β”€ Prefetching|0.0016
    β”œβ”€β”€ [05C1D0,05C1E0) move next|4501.1199
    β”‚   └── [05C1D0,05C1E0) move next|4501.1151
    β”‚       └── Prefetch|4501.1094
    β”‚           └── [05C1D0,05C1E0) move next|4501.1046 | Retrieved Document Count                 :          46,027             
Retrieved Document Size                  :      47,178,918 bytes       
Output Document Count                    :               0             
Output Document Size                     :              49 bytes       
Index Utilization                        :            0.00 %           
Total Query Execution Time               :        4,496.98 milliseconds
  Query Preparation Time                 :            0.73 milliseconds
  Index Lookup Time                      :          283.03 milliseconds
  Document Load Time                     :        4,077.54 milliseconds
  Runtime Execution Times                :           20.07 milliseconds
  Document Write Time                    :            0.00 milliseconds

Index Utilization Information
  Utilized Single Indexes
  Potential Single Indexes
  Utilized Composite Indexes
  Potential Composite Indexes

    β”‚               β”œβ”€β”€ Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler|4500.8872
    β”‚               β”‚   β”œβ”€β”€ Get Partition Key Range Cache|0.0008
    β”‚               β”‚   β”œβ”€β”€ Get Collection Cache|0.0005
    β”‚               β”‚   β”œβ”€β”€ Try Get Overlapping Ranges|0.0117
    β”‚               β”‚   └── Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler|4500.8402
    β”‚               β”‚       └── Send Async|4500.8332
    β”‚               β”‚           └── Send Async|4500.6289
    β”‚               β”‚               └── Send Async|4500.6255
    β”‚               β”‚                   └── Send Async|4500.6232
    β”‚               β”‚                       └── Send Async|4500.6151
    β”‚               β”‚                           └── Microsoft.Azure.Documents.ServerStoreModel Transport Request|4500.5884
    β”‚               └── Get Cosmos Element Response|0.1054
    └── POCO Materialization|0.0322
YohanSciubukgian commented 3 years ago

Is it related to #1486 ?

neildsh commented 3 years ago

This should already be fixed by https://github.com/Azure/azure-cosmos-dotnet-v3/pull/2168

ealsur commented 3 years ago

@neildsh The user is on 3.17.1, which already contains that PR, could this be then a different case?

neildsh commented 3 years ago

@neildsh The user is on 3.17.1, which already contains that PR, could this be then a different case?

Hi @ealsur , the symptom of loading all documents was seen when the user was on 3.16. Then the user upgraded to 3.17.1 for which we see only two requests being sent out from the SDK. However, they ran into the concurrent modification of diagnostics issue, which led @j82w to recommend that they use the 3.18.0-preview

darthkurak commented 3 years ago

Sorry, but im bit confused with your conversation. What is the conclusion? To be clear: We still observe behavior of loading all documents on 3.17.1 against the logic of executed query and HasMoreResults code. diagnostic logs as executed query was provided above. Our logging based on cosmos sdk request handler indicate 31 calls in summary for this particular query execution. It doesn’t happen always. We have few occurrences per day, mainly in the morning. Exactly same query executed manually from portal or repeated later by user works correctly. 3.17.1 fixed concurrent problem with dictionary on diagnostics.

neildsh commented 3 years ago

Hi @darthkurak , I think a few things are happening here:

The search performed is very expensive since it is case insensitive. Also, on the portal, it looks like we are loading only one document and not actually draining the continuation fully. That would explain why it is cheaper... it could be that the portal is getting "lucky" when it hits the partitions in sequence and the very first partition has a matching result, and we dont drain the continuation after that. OTOH, the SDK hits all partitions in parallel, so you have to pay for all those calls.

I would recommend trying an exact match on the email instead of the case insensitive string comparison. I think that we should see a significant improvement here.

darthkurak commented 3 years ago

I know that search is very expensive. Normally it costs us around 800-900 RU. But there are few search (exactly the same) which costs 50K ru's. And from client sides SDK generates 31 calls to Cosmos. So, it doesn't looks like normal behavior.

j82w commented 3 years ago

@darthkurak the diagnostics you provided only shows doing 2 network Direct + TCP calls.

  1. Can you please change your application to use a singleton? From the diagnostics there is 3 instances so far. This will cause a lot of issues in the future include high latency because the SDK needs to warm up the caches.
  2. Based on the code posted in this issue, and the output you shared it is missing the diagnostics of the first page. Can you please capture the diagnostics for all the pages?
neildsh commented 3 years ago

I know that search is very expensive. Normally it costs us around 800-900 RU. But there are few search (exactly the same) which costs 50K ru's. And from client sides SDK generates 31 calls to Cosmos. So, it doesn't looks like normal behavior.

Hi @darthkurak , as @j82w suggested, can you please gather the full diagnostics for these requests that you mentioned? When you say the searches are "exactly the same", do mean that exactly the same query (including parameters) costs 800-900RU sometimes and 50k RUs at others ?

Did you consider my recommendation above for using an exact match instead of a case insensitive match ? That will help a lot to reduce the latency and cost.

darthkurak commented 3 years ago

@j82w We can't change to singleton right now. We have 3 instances for purpose as each of it manage separate resource. Maybe one of them is redundant, so we will fix this soon. Good to know, we will optimize it. But this is another case. I see no connection with reported problem. I pasted diagnostics logs for both pages. You even formatted both of them :) https://github.com/Azure/azure-cosmos-dotnet-v3/issues/2313#issuecomment-806407369 Also as i wrote - from SDK point of view, we have two pages, but this two pages made 31 calls to Cosmos. Please, take a look at provided App Insights screenshot. In captured occurrence - one page generated 16 calls, second - 15. For each call i have Query Metrics if needed. @neildsh i gathered full diagnostics for both pages, you have them already When you say the searches are "exactly the same", do mean that exactly the same query (including parameters) costs 800-900RU sometimes and 50k RUs at others ? Yes, exactly. This is why I opened this thread in first place. Cost, behavior and captured diagnostics indicated, that for unknown reason, particular query, sometimes goes crazy and load ALL documents (retrieved documents) AGAINST it's LOGIC with multiple calls behind the scene, in result returning 0 or 1 result to the client. We can't use right now exact match, as it is needed for our buisness logic. We aware of it's cost and not optimal performance (it costs around 800 RU) and we accept it right now, but this 50K+ just break our production SLI, causing performance degradation on cosmos with no reason. It has no logical justification in our code and any documentation provided for Cosmos DB service. Please, take a look carefully on provided by me data and description. We want to resolve 50k+ issue in first place, not optimize other things (which for sure we take a closer look later and thanks for point them out). I would like to find this common understanding what exact problem do we address here. It's not a mater of optimizing this query. It's matter of find the root cause of random 50k+ RU execution for the SAME query, which most of the time executes and behave as expected (it costs 700-800 RU, and takes 300-400ms). Broken execution costs 50K+ RU, elapse around 2 minutes, sometimes even more, and gathered QUERY METRICS indicate that it touches all documents of collection (retrieved count) making around 30 calls to COSMOS DB, in the end, returning to the client one or zero document.

darthkurak commented 3 years ago

You can see that ReadNextAsync for first page takes 72s, but actually prefetch takes 4.5 s and there is nothing more in diagnostics logs which would show where the rest of 68s s is spent. I think that your diagnostics logs somehow do not log this edge case, because our independent logging based on "https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.cosmos.requesthandler?view=azure-dotnet", which screenshot from AppInsights is included above, report many more requests which takes also 4.5 seconds and sum up to this 72s. Also, formatted by you diagnostics logs for first page do not match the one which i pasted (wrong copy-paste?) For second page, case is exactly the same. We have 60s, but diagnostics shows only one 4.5 seconds call, and rest time is unknown:

root|root └── Typed FeedIterator ReadNextAsync|59751.2997 β”œβ”€β”€ Prefetching|0.0016 β”œβ”€β”€ [05C1D0,05C1E0) move next|4501.1199 β”‚ └── [05C1D0,05C1E0) move next|4501.1151 β”‚ └── Prefetch|4501.1094 β”‚ └── [05C1D0,05C1E0) move next|4501.1046 | Retrieved Document Count

j82w commented 3 years ago

If they each have unique resources then it's not an issue if there is 3 clients. The scenario that is a problem is if the SDK client is getting re-created for the same resource. We seen users create a new client for each Cosmos operation which causes issues since the caches have to get warmed for each new instance.

Just to clarify the times in the diagnostics are in Milliseconds, not seconds.

darthkurak commented 3 years ago

I know, but 59751.2997 ms give us 60 seconds, right? :) That is not the case. We do not create clients each time. We have them created upfront in startup and they are reused :)

darthkurak commented 3 years ago

Any update? :)

neildsh commented 3 years ago

You can see that ReadNextAsync for first page takes 72s, but actually prefetch takes 4.5 s and there is nothing more in diagnostics logs which would show where the rest of 68s s is spent. I think that your diagnostics logs somehow do not log this edge case, because our independent logging based on "https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.cosmos.requesthandler?view=azure-dotnet", which screenshot from AppInsights is included above, report many more requests which takes also 4.5 seconds and sum up to this 72s. Also, formatted by you diagnostics logs for first page do not match the one which i pasted (wrong copy-paste?) For second page, case is exactly the same. We have 60s, but diagnostics shows only one 4.5 seconds call, and rest time is unknown:

root|root └── Typed FeedIterator ReadNextAsync|59751.2997 β”œβ”€β”€ Prefetching|0.0016 β”œβ”€β”€ [05C1D0,05C1E0) move next|4501.1199 β”‚ └── [05C1D0,05C1E0) move next|4501.1151 β”‚ └── Prefetch|4501.1094 β”‚ └── [05C1D0,05C1E0) move next|4501.1046 | Retrieved Document Count

Hi @darthkurak , thank you for following up. The backend logs will always show us all the requests. However, we do need the activityId to look them up. For these two activityIds that you shared initially:

7c7af65b-9938-4f91-8c89-ac587b071378
5856c533-aa86-4f20-b4ec-ed6d23a3b058

We did see a bunch of continuations on the backend because the search is case insensitive so we were not able to push it to the index as efficiently, this would result in more documents being loaded on the backend and hence the high RU cost. The backend is combing through all the documents until it finds the one match. This does not seem like an SDK issue. Since you already have an ICM open on the backend, let's follow up there to make the query more efficient. OTOH, if the claim is that this only happens some of the time for exactly one query, we can investigate that well. However, I am going to close this issue here, since it does not look like an SDK issue at this point.

YohanSciubukgian commented 3 years ago

@neildsh Where could we (community) track the fix/improvement done on backend side if it's not in this PR?

darthkurak commented 3 years ago

@neildsh yes, exactly the same query (with the same parameters) results most of the time with 700-800 rus and one call to backend. But sometimes it takes 50k rus and multiple calls to backend. It seems like here is still miss understanding of the problem. I have evident logs which tell exactly what calls were made. If this is on backend side - why I see in AppInsights logs 31 calls to Cosmos instance produced by 2 calls of ReadNextAsync?

timsander1 commented 3 years ago

Hi @YohanSciubukgian , backend improvements aren't tracked here but we'll blog about significant ones here: https://devblogs.microsoft.com/cosmosdb/

@darthkurak , if you're still seeing the 31 calls to Cosmos DB for that query, could you share an example diagnostics that show the 31 calls? Based on the diagnostics shared it above, I only see 2 calls. For improving string equals performance, have you tried this tip here? https://docs.microsoft.com/azure/cosmos-db/troubleshoot-query-performance#improve-string-system-function-execution