Open lampersky opened 5 months ago
Following.
Probably not the same fix but same root cause. If there is a query that returns multiple objects (in our case content items), or on object has multiple children to load (content items again) then if GraphQL.NET uses async operations that are not awaited, or if it uses threads (Parallel.ForEach for instance) at that point it's possible that two concurrent accesses to the same session (the one from the scoped DI) will conflict.
The solution is to find from the query how GraphQL.NET is handling these. With the previous version we had changed the default behavior which was to do thinks asynchronously to synchronously, I even filed a bug in their repos to tell them it's shouldn't be a default, too risky. Now it could just be our code that is not awaiting a task (like the fix you linked).
I will try to follow the breadcrumbs from the trace you shared:
GraphQL.Execution.ExecutionStrategy.CompleteDataLoaderNodeAsync
GraphQL.DataLoader.SimpleDataLoader
1.GraphQL.DataLoader.IDataLoaderResult.GetResultAsync
OrchardCore.ContentManagement.GraphQL.DataLoaderExtensions.LoadPublishedContentItemsAsync
GraphQL.DataLoader.CollectionBatchDataLoader.FetchAsync
@lampersky Is your query using content items containing or contained in a ContentPickerField
?
@sebastienros yes
I wasn't able to repro on sqlite. But will try tomorrow. The reported problem is on SQL Server.
Query looks sth like this, it is not the original one. So, there are nested lists, there are content item pickers and also bag parts.
query MyQuery {
event(where: { contentItemId : "40ke33ne4k931544be1hrk57cb" }) {
displayText
contentItemId
organizers {
contentItems(first: 30) {
... on Speaker {
displayText
modifiedUtc
contentType
createdUtc
latest
owner
published
publishedUtc
}
}
}
list {
contentItems(first: 300) {
... on Track {
displayText
contentItemId
moderators {
contentItems(first: 30) {
... on Speaker {
displayText
modifiedUtc
contentType
createdUtc
latest
owner
published
publishedUtc
}
}
}
list {
contentItems(first: 300) {
... on Session {
displayText
modifiedUtc
contentType
createdUtc
latest
owner
published
publishedUtc
test {
contentItems(first: 30) {
... on Speaker {
displayText
contentItemId
}
}
}
speakers {
contentItems(first: 20) {
... on Speaker {
displayText
modifiedUtc
contentType
createdUtc
latest
owner
published
publishedUtc
}
}
}
}
}
}
}
}
}
}
}
@sebastienros this is the smallest possible query, all unnecessary things were removed (ContentPickers, BagParts etc.), so we only have nested list parts
GraphQL.Execution.UnhandledError: Error trying to resolve field 'contentItems'.
---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
at System.Collections.Generic.Dictionary`2.FindValue(TKey key)
at YesSql.Session.Get[T](IList`1 documents, String collection)
at YesSql.Services.DefaultQuery.Query`1.ListImpl()
at YesSql.Services.DefaultQuery.Query`1.ListImpl()
at OrchardCore.Lists.GraphQL.ListQueryObjectType.LoadPublishedContentItemsForListAsync(IEnumerable`1 contentItemIds, ISession session)
at GraphQL.DataLoader.CollectionBatchDataLoader`2.FetchAsync(IEnumerable`1 list, CancellationToken cancellationToken)
at GraphQL.DataLoader.DataLoaderPair`2.GetResultAsync(CancellationToken cancellationToken) in /_/src/GraphQL.DataLoader/DataLoaderPair.cs:line 81
at OrchardCore.Lists.GraphQL.ListQueryObjectType.<>c.<<-ctor>b__0_2>d.MoveNext()
--- End of stack trace from previous location ---
at GraphQL.Resolvers.FuncFieldResolver`2.<>c__DisplayClass3_0.<<-ctor>b__1>d.MoveNext()
--- End of stack trace from previous location ---
at GraphQL.Execution.ExecutionStrategy.ExecuteNodeAsync(ExecutionContext context, ExecutionNode node) in /_/src/GraphQL/Execution/ExecutionStrategy.cs:line 472
--- End of inner exception stack trace ---
I've simplified resolve part to the minimum, to exclude loader problems etc, and I'm still facing the issue. It looks like ResolveLockedAsync is not enough to lock querying YesSql.
when I changed to ResolveAsync and added semaphore, I can't repro the issue anymore.
Other than that, using loader for list part is questionable and could be improved. It looks like all items are fetched and skip/first is applied on materialized collection.
Anyway, it would be good to know your thoughts @sebastienros.
I understand how it's fixing all the issues, but that's also wrong, because it's a static semaphore, so this is actually blocking every single call, even from different db sessions. The lock should be per session, or request.
Looking at the current ResolveLockedAsync it seems better, as it's on the context, and the code is identical I believe. Maybe the graphQlContext is actually cloned, that could be checked with some logging.
I wasn't able to repro on sqlite. But will try tomorrow. The reported problem is on SQL Server.
@lampersky ,I seem to be running into this problem, but I can't be sure because when I run into concurrency issues, I do this by adding MultipleActiveResultSets=True; To my connection string, and it was fixed
I was surprised to see your description, I was beginning to think it was just on Sqlite
I noticed that the ResolveLockedAsync method is not effective here. If you set a breakpoint in the LoadPublishedContentItemsForListAsync method, you can see that they are executed in parallel. I solved this problem by setting a synchronization lock in this method, but I am not sure if this is the best solution.
YesSQL (5.0.0) has added an EnableThreadSafetyCheck setting. After setting it to true, it is clear that the LoadPublishedContentitemsForListAsync method is not thread safe.
@lampersky ,I seem to be running into this problem, but I can't be sure because when I run into concurrency issues, I do this by adding MultipleActiveResultSets=True; To my connection string, and it was fixed
@hyzx86 sorry for late reply, I've tried to reproduce this issue on sqlite using a smaller sample of data, and that is why I wrote that I'm not able to reproduce on sqlite. I thought I will be able prepare some smaller recipe, but unfortunately it is not as easy, my recipe file has almost 20MB. So, it looks like the problem can be only observed when you have lots of items and you are querying all of them.
I probably understand your situation, since your paging size is 300+, it's possible that the next query was requested while the current request was still pending, hence the concurrency issue.
I think it's finally time to introduce SerialDocumentExecuter. As was discussed few years ago, new versions of graphql introduced serial execution for dataloader meaning we can discard LockedAsyncFieldResolver and solve all problems by executing in serial, as it was in beginning
@tropcicstefan right, there is this: https://github.com/graphql-dotnet/graphql-dotnet/blob/master/src/GraphQL/Execution/DefaultExecutionStrategySelector.cs#L43
That the DocumentExecuter uses, and the default is Parallel for queries. So there must be a way top configure it to use Serial for queries, or create a custom ExecutionStrategySelector ans use it.
a custom ExecutionStrategySelector ans use it.
(and there is always scoped scope as an option in resolver)
I have the feeling that I already did this some years ago, changing the default execution policy to use serial for queries... So maybe this is not the problem, the Data loader may have a different one.
I actually think with the changes in gql dotnet we can remove the locks .. and use the serial data execution strat we used too but removed because it didn’t support dataloaders
maybe won't solve the issue but code would be cleaner, without locked customizations
When using GraphiQL editor, sometimes I'm getting this exception, when querying content items with nested lists. It's not consistently repeatable for me (approx once every five queries).
Stacktrace:
@sebastienros I've just found that you fixed similar issue 3 years ago.