OrchardCMS / OrchardCore

Orchard Core is an open-source modular and multi-tenant application framework built with ASP.NET Core, and a content management system (CMS) built on top of that framework.
https://orchardcore.net
BSD 3-Clause "New" or "Revised" License
7.37k stars 2.38k forks source link

[GraphQL] A concurrent update was performed on this collection and corrupted its state. #15731

Open lampersky opened 5 months ago

lampersky commented 5 months ago

When using GraphiQL editor, sometimes I'm getting this exception, when querying content items with nested lists. It's not consistently repeatable for me (approx once every five queries).

image

Stacktrace:

GraphQL.Execution.UnhandledError: Error trying to resolve field 'contentItems'.
 ---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
   at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)
   at YesSql.Session.Get[T](IList`1 documents, String collection)
   at YesSql.Services.DefaultQuery.Query`1.ListImpl()
   at YesSql.Services.DefaultQuery.Query`1.ListImpl()
   at OrchardCore.ContentManagement.GraphQL.DataLoaderExtensions.LoadPublishedContentItemsAsync(IEnumerable`1 contentItemIds, ISession session)
   at GraphQL.DataLoader.CollectionBatchDataLoader`2.FetchAsync(IEnumerable`1 list, CancellationToken cancellationToken)
   at GraphQL.DataLoader.DataLoaderExtensions.<>c__DisplayClass5_0`2.<<Then>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.DataLoader.DataLoaderPair`2.GetResultAsync(CancellationToken cancellationToken) in /_/src/GraphQL.DataLoader/DataLoaderPair.cs:line 81
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.DataLoader.DataLoaderPair`2.GetResultAsync(CancellationToken cancellationToken) in /_/src/GraphQL.DataLoader/DataLoaderPair.cs:line 81
   at GraphQL.DataLoader.DataLoaderPair`2.GetResultAsync(CancellationToken cancellationToken) in /_/src/GraphQL.DataLoader/DataLoaderPair.cs:line 81
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.DataLoader.DataLoaderExtensions.<>c__DisplayClass5_0`2.<<Then>b__0>d.MoveNext()
   at GraphQL.DataLoader.DataLoaderExtensions.<>c__DisplayClass5_0`2.<<Then>b__0>d.MoveNext()
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.DataLoader.DataLoaderExtensions.<>c__DisplayClass5_0`2.<<Then>b__0>d.MoveNext()
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.DataLoader.DataLoaderExtensions.<>c__DisplayClass5_0`2.<<Then>b__0>d.MoveNext()
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.DataLoader.DataLoaderExtensions.<>c__DisplayClass5_0`2.<<Then>b__0>d.MoveNext()
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.DataLoader.DataLoaderExtensions.<>c__DisplayClass5_0`2.<<Then>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   at GraphQL.DataLoader.SimpleDataLoader`1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync(CancellationToken cancellationToken)
   at GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 515
   --- End of inner exception stack trace ---

@sebastienros I've just found that you fixed similar issue 3 years ago.

aaronamm commented 5 months ago

Following.

sebastienros commented 5 months ago

Probably not the same fix but same root cause. If there is a query that returns multiple objects (in our case content items), or on object has multiple children to load (content items again) then if GraphQL.NET uses async operations that are not awaited, or if it uses threads (Parallel.ForEach for instance) at that point it's possible that two concurrent accesses to the same session (the one from the scoped DI) will conflict.

The solution is to find from the query how GraphQL.NET is handling these. With the previous version we had changed the default behavior which was to do thinks asynchronously to synchronously, I even filed a bug in their repos to tell them it's shouldn't be a default, too risky. Now it could just be our code that is not awaiting a task (like the fix you linked).

I will try to follow the breadcrumbs from the trace you shared: GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync GraphQL.DataLoader.SimpleDataLoader1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync OrchardCore.ContentManagement.GraphQL.DataLoaderExtensions.LoadPublishedContentItemsAsync GraphQL.DataLoader.CollectionBatchDataLoader.FetchAsync

sebastienros commented 5 months ago

@lampersky Is your query using content items containing or contained in a ContentPickerField?

lampersky commented 5 months ago

@sebastienros yes

lampersky commented 5 months ago

I wasn't able to repro on sqlite. But will try tomorrow. The reported problem is on SQL Server.

Query looks sth like this, it is not the original one. So, there are nested lists, there are content item pickers and also bag parts.

query MyQuery {
  event(where: { contentItemId : "40ke33ne4k931544be1hrk57cb" }) {
    displayText
    contentItemId
    organizers {
      contentItems(first: 30) {
        ... on Speaker {
                displayText
                modifiedUtc
                contentType
                createdUtc
                latest
                owner
                published
                publishedUtc
        }
      }
    }
    list {
      contentItems(first: 300) {
        ... on Track {
          displayText
          contentItemId
          moderators {
            contentItems(first: 30) {
              ... on Speaker {
                displayText
                modifiedUtc
                contentType
                createdUtc
                latest
                owner
                published
                publishedUtc
              }
            }
          }
          list {
            contentItems(first: 300) {
              ... on Session {
                displayText
                modifiedUtc
                contentType
                createdUtc
                latest
                owner
                published
                publishedUtc
                test {
                  contentItems(first: 30) {
                    ... on Speaker {
                      displayText
                      contentItemId
                    }
                  }
                }
                speakers {
                  contentItems(first: 20) {
                    ... on Speaker {
                      displayText
                      modifiedUtc
                      contentType
                      createdUtc
                      latest
                      owner
                      published
                      publishedUtc
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
lampersky commented 5 months ago

@sebastienros this is the smallest possible query, all unnecessary things were removed (ContentPickers, BagParts etc.), so we only have nested list parts

image

GraphQL.Execution.UnhandledError: Error trying to resolve field 'contentItems'.
 ---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
   at System.Collections.Generic.Dictionary`2.FindValue(TKey key)
   at YesSql.Session.Get[T](IList`1 documents, String collection)
   at YesSql.Services.DefaultQuery.Query`1.ListImpl()
   at YesSql.Services.DefaultQuery.Query`1.ListImpl()
   at OrchardCore.Lists.GraphQL.ListQueryObjectType.LoadPublishedContentItemsForListAsync(IEnumerable`1 contentItemIds, ISession session)
   at GraphQL.DataLoader.CollectionBatchDataLoader`2.FetchAsync(IEnumerable`1 list, CancellationToken cancellationToken)
   at GraphQL.DataLoader.DataLoaderPair`2.GetResultAsync(CancellationToken cancellationToken) in /_/src/GraphQL.DataLoader/DataLoaderPair.cs:line 81
   at OrchardCore.Lists.GraphQL.ListQueryObjectType.<>c.<<-ctor>b__0_2>d.MoveNext()
--- End of stack trace from previous location ---
   at GraphQL.Resolvers.FuncFieldResolver`2.<>c__DisplayClass3_0.<<-ctor>b__1>d.MoveNext()
--- End of stack trace from previous location ---
   at GraphQL.Execution.ExecutionStrategy.ExecuteNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 472
   --- End of inner exception stack trace ---
lampersky commented 5 months ago

I've simplified resolve part to the minimum, to exclude loader problems etc, and I'm still facing the issue. It looks like ResolveLockedAsync is not enough to lock querying YesSql.

image

when I changed to ResolveAsync and added semaphore, I can't repro the issue anymore.

image

Other than that, using loader for list part is questionable and could be improved. It looks like all items are fetched and skip/first is applied on materialized collection.

Anyway, it would be good to know your thoughts @sebastienros.

sebastienros commented 5 months ago

I understand how it's fixing all the issues, but that's also wrong, because it's a static semaphore, so this is actually blocking every single call, even from different db sessions. The lock should be per session, or request.

Looking at the current ResolveLockedAsync it seems better, as it's on the context, and the code is identical I believe. Maybe the graphQlContext is actually cloned, that could be checked with some logging.

hyzx86 commented 5 months ago

I wasn't able to repro on sqlite. But will try tomorrow. The reported problem is on SQL Server.

@lampersky ,I seem to be running into this problem, but I can't be sure because when I run into concurrency issues, I do this by adding MultipleActiveResultSets=True; To my connection string, and it was fixed

I was surprised to see your description, I was beginning to think it was just on Sqlite

guanwu commented 4 months ago

I noticed that the ResolveLockedAsync method is not effective here. If you set a breakpoint in the LoadPublishedContentItemsForListAsync method, you can see that they are executed in parallel. I solved this problem by setting a synchronization lock in this method, but I am not sure if this is the best solution.

image

YesSQL (5.0.0) has added an EnableThreadSafetyCheck setting. After setting it to true, it is clear that the LoadPublishedContentitemsForListAsync method is not thread safe.

lampersky commented 4 months ago

@lampersky ,I seem to be running into this problem, but I can't be sure because when I run into concurrency issues, I do this by adding MultipleActiveResultSets=True; To my connection string, and it was fixed

@hyzx86 sorry for late reply, I've tried to reproduce this issue on sqlite using a smaller sample of data, and that is why I wrote that I'm not able to reproduce on sqlite. I thought I will be able prepare some smaller recipe, but unfortunately it is not as easy, my recipe file has almost 20MB. So, it looks like the problem can be only observed when you have lots of items and you are querying all of them.

hyzx86 commented 4 months ago

I probably understand your situation, since your paging size is 300+, it's possible that the next query was requested while the current request was still pending, hence the concurrency issue.

tropcicstefan commented 4 months ago

I think it's finally time to introduce SerialDocumentExecuter. As was discussed few years ago, new versions of graphql introduced serial execution for dataloader meaning we can discard LockedAsyncFieldResolver and solve all problems by executing in serial, as it was in beginning

sebastienros commented 4 months ago

@tropcicstefan right, there is this: https://github.com/graphql-dotnet/graphql-dotnet/blob/master/src/GraphQL/Execution/DefaultExecutionStrategySelector.cs#L43

That the DocumentExecuter uses, and the default is Parallel for queries. So there must be a way top configure it to use Serial for queries, or create a custom ExecutionStrategySelector ans use it.

tropcicstefan commented 4 months ago

a custom ExecutionStrategySelector ans use it.

here

(and there is always scoped scope as an option in resolver)

sebastienros commented 4 months ago

I have the feeling that I already did this some years ago, changing the default execution policy to use serial for queries... So maybe this is not the problem, the Data loader may have a different one.

tropcicstefan commented 4 months ago

I actually think with the changes in gql dotnet we can remove the locks .. and use the serial data execution strat we used too but removed because it didn’t support dataloaders

carlwoodhouse's comment

maybe won't solve the issue but code would be cleaner, without locked customizations