Closed antonioaversa closed 1 year ago
Thanks!
@KirillOsenkov Thanks for reviewing and merging this: that was fast! Would you be able to release a new version of the MSBuildStructuredLog: we are occasionally reproducing this issue in our CI pipeline and would like have it fixed. Thanks!
yes will try to get to it today
published 2.1.814, thanks again!
When you were seeing the race conditions initially, did the build you were using contain the lock statements I added in https://github.com/KirillOsenkov/MSBuildStructuredLog/commit/d1c4eff09277acf4f5afb0aae47180bf077ec442?
If possible, if you could debug that race condition under the VS debugger, and when the exception happens, print the call stacks of threads accessing the dictionary.
The Instances property is only used during search, which shouldn't happen during binlog loading time.
I found what was wrong here: https://github.com/KirillOsenkov/MSBuildStructuredLog/issues/673#issuecomment-1518466318
It's subtle and not easy to see, so not blaming you at all. We relied on the fact that Dictionary<K,V>.Keys never changes, but for ConcurrentDictionary<K,V>.Keys returns a different read-only collection every time you call it. Makes sense in hindsight.
Let's confirm that using a simple Dictionary with locks fixes your race conditions. I made that commit on March 26: https://github.com/KirillOsenkov/MSBuildStructuredLog/commit/d1c4eff09277acf4f5afb0aae47180bf077ec442
whereas the previous release was on March 13: https://github.com/KirillOsenkov/MSBuildStructuredLog/releases/tag/v2.1.790
So I think what happened is that you saw the race conditions without the lock that I added on March 26.
I will be reluctant to accept another change to ConcurrentDictionary, because now that I think about it, there might be performance implications for both binlog loading time as well as search. We'd need to carefully measure the binlog loading perf for large binlogs for both approaches, as well as search for strings in a large binlog, and see which approach is faster.
When you were seeing the race conditions initially, did the build you were using contain the lock statements I added in d1c4eff?
The stack trace reported here (full version of the stack trace at the end of this message) was generated by an execution in our CI infrastructure, using version 2.1.630, which indeed didn't include d1c4eff.
We have also reproduced the issue with 2.1.790 (latest released version before 2.1.814), on a binlog file for a different project (nodatime). That version didn't include the lock either.
However, I have reproduce a similar race condition locally, on top of 0ab53ca (which does include d1c4eff, with the following test:
[Fact]
public void Intern()
{
var cache = new StringCache() { DisableDeduplication = false };
var internerThread = new System.Threading.Thread(() =>
{
for (var i = 0L; i < long.MaxValue; i++)
{
cache.Intern(i.ToString());
//System.Threading.Thread.Sleep(10);
}
});
var instanceAccessThread = new System.Threading.Thread(() =>
{
for (var i = 0L; i < long.MaxValue; i++)
{
System.Threading.Tasks.Parallel.ForEach(cache.Instances, s => _ = s.IndexOf("0"));
//System.Threading.Thread.Sleep(10);
}
});
internerThread.Start();
instanceAccessThread.Start();
internerThread.Join();
instanceAccessThread.Join();
}
Running this test in debug from Visual Studio produces InvalidOperationException: Collection was modified; enumeration operation may not execute.
, which is seems to be fixed with a ConcurrentDictionary
(running correctly for many iterations of i
).
This test code, however, is much simpler and different from the Parallel.ForEach
executed in Microsoft.Build.Logging.StructuredLogger.Construction
, that is generating the exception detected in our CI.
So I am going to update our CI to use 2.1.815 and report a new issue in case we can still see a race condition happening.
The line in our code, generating the InvalidOperationException
is:
var root = BinaryLog.ReadBuild(filePath);
Here is the full stack trace:
Unhandled exception. System.AggregateException: One or more errors occurred. (One or more errors occurred. (One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)) (One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)))
---> System.AggregateException: One or more errors occurred. (One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)) (One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.))
---> System.AggregateException: One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)
---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
at System.Collections.Generic.Dictionary`2.FindValue(TKey key)
at System.Collections.Generic.Dictionary`2.TryGetValue(TKey key, TValue& value)
at Microsoft.Build.Logging.StructuredLogger.StringCache.Intern(String text) in C:\MSBuildStructuredLog\src\StructuredLogger\Serialization\StringCache.cs:line 69
at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(TreeNode parent, IEnumerable`1 properties, IProjectOrEvaluation project) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 1099
at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(Folder propertiesFolder, TreeNode project, IEnumerable properties) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 984
at Microsoft.Build.Logging.StructuredLogger.Construction.<>c__DisplayClass51_0.<UpdateProject>b__0() in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 890
at System.Threading.Tasks.Task.InnerInvoke()
at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Task.Wait()
at Microsoft.Build.Logging.StructuredLogger.Construction.<>c.<.ctor>b__17_1(Task task) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 79
at System.Threading.Tasks.Parallel.<>c__DisplayClass44_0`2.<PartitionerForEachWorker>b__1(IEnumerator& partitionState, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location ---
at System.Threading.Tasks.Parallel.<>c__DisplayClass44_0`2.<PartitionerForEachWorker>b__1(IEnumerator& partitionState, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
at System.Threading.Tasks.TaskReplicator.Replica`1.ExecuteAction(Boolean& yieldedBeforeCompletion)
at System.Threading.Tasks.TaskReplicator.Replica.Execute()
--- End of inner exception stack trace ---
at System.Threading.Tasks.TaskReplicator.Run[TState](ReplicatableUserAction`1 action, ParallelOptions options, Boolean stopOnFirstFailure)
at System.Threading.Tasks.Parallel.PartitionerForEachWorker[TSource,TLocal](Partitioner`1 source, ParallelOptions parallelOptions, Action`1 simpleBody, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
--- End of stack trace from previous location ---
at System.Threading.Tasks.Parallel.ThrowSingleCancellationExceptionOrOtherException(ICollection exceptions, CancellationToken cancelToken, Exception otherException)
at System.Threading.Tasks.Parallel.PartitionerForEachWorker[TSource,TLocal](Partitioner`1 source, ParallelOptions parallelOptions, Action`1 simpleBody, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable`1 source, Action`1 body)
at Microsoft.Build.Logging.StructuredLogger.Construction.<.ctor>b__17_0() in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 76
at System.Threading.Tasks.Task.InnerInvoke()
at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
---> (Inner Exception #1) System.AggregateException: One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)
---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
at System.Collections.Generic.Dictionary`2.FindValue(TKey key)
at System.Collections.Generic.Dictionary`2.TryGetValue(TKey key, TValue& value)
at Microsoft.Build.Logging.StructuredLogger.StringCache.Intern(String text) in C:\MSBuildStructuredLog\src\StructuredLogger\Serialization\StringCache.cs:line 69
at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(TreeNode parent, IEnumerable`1 properties, IProjectOrEvaluation project) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 1099
at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(Folder propertiesFolder, TreeNode project, IEnumerable properties) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 984
at Microsoft.Build.Logging.StructuredLogger.Construction.<>c__DisplayClass51_0.<UpdateProject>b__0() in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 890
at System.Threading.Tasks.Task.InnerInvoke()
at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Task.Wait()
at Microsoft.Build.Logging.StructuredLogger.Construction.<>c.<.ctor>b__17_1(Task task) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 79
at System.Threading.Tasks.Parallel.<>c__DisplayClass44_0`2.<PartitionerForEachWorker>b__1(IEnumerator& partitionState, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location ---
at System.Threading.Tasks.Parallel.<>c__DisplayClass44_0`2.<PartitionerForEachWorker>b__1(IEnumerator& partitionState, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
at System.Threading.Tasks.TaskReplicator.Replica`1.ExecuteAction(Boolean& yieldedBeforeCompletion)
at System.Threading.Tasks.TaskReplicator.Replica.Execute()<---
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Task.Wait()
at Microsoft.Build.Logging.StructuredLogger.Construction.Shutdown() in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 88
at Microsoft.Build.Logging.StructuredLogger.StructuredLogger.Shutdown() in C:\MSBuildStructuredLog\src\StructuredLogger\StructuredLogger.cs:line 102
at Microsoft.Build.Logging.StructuredLogger.BinaryLog.ReadBuild(Stream stream, Progress progress, Byte[] projectImportsArchive) in C:\MSBuildStructuredLog\src\StructuredLogger\BinaryLog.cs:line 123
at Microsoft.Build.Logging.StructuredLogger.BinaryLog.ReadBuild(String filePath, Progress progress) in C:\MSBuildStructuredLog\src\StructuredLogger\BinaryLog.cs:line 41
at Microsoft.Build.Logging.StructuredLogger.BinaryLog.ReadBuild(String filePath) in C:\MSBuildStructuredLog\src\StructuredLogger\BinaryLog.cs:line 28
at ParseBuildOutput.LogParsing.LogParser..ctor(String filePath, String projectName) in C:\Windows\TEMP\cirrus-ci-build\ParseBuildOutput\ParseBuildOutput\LogParsing\LogParser.cs:line 47
at ParseBuildOutput.Program.Process(Options options) in C:\Windows\TEMP\cirrus-ci-build\ParseBuildOutput\ParseBuildOutput\Program.cs:line 28
at CommandLine.ParserResultExtensions.WithParsed[T](ParserResult`1 result, Action`1 action)
at ParseBuildOutput.Program.Main(String[] args) in C:\Windows\TEMP\cirrus-ci-build\ParseBuildOutput\ParseBuildOutput\Program.cs:line 24
I will be reluctant to accept another change to ConcurrentDictionary, because now that I think about it, there might be performance implications for both binlog loading time as well as search. We'd need to carefully measure the binlog loading perf for large binlogs for both approaches, as well as search for strings in a large binlog, and see which approach is faster.
Indeed. I would say that even a simple lock of the deduplicationMap
, as done in d1c4eff, may have already had significant performance implications.
Since our update to 2.1.790 in our CI infrastructure, done on April 17, we have noticed a fair amount of timeout when reading binlog files. Timeouts come after 50min-1h, on binlog files smaller than a MB and which are read successfully in seconds, when read locally.
That may be an indication of more multithreading issues: we will collect more data and report if we keep observing the behavior and we gain insight on the possible cause of the issue.
Yes, the change to make binlog loading parallel is recent, and I didn't review it very thoroughly. It does seem like it needs to be ironed out.
Agreed that your unit-test reproduces the issue. However in real life the Instances collection is only accessed from two places: from the search function and the find in files function. Both happen after the binlog has loaded. We are not accessing the Instances collection (which gets the Keys) during the binlog loading.
I'd like to know more about the timeouts you're seeing. It may be that it's a separate bug, I'd be happy to help you investigate those.
Context
Microsoft.Build.Logging.StructuredLogger.StringCache.Intern
uses a non-concurrentDictionary
instance,deduplicationMap
, to store strings. Accesses todeduplicationMap
is currently done inlock
statements, in theIntern
andContains
methods. However,StringCache.Instances
stores and publicly exposes theKeys
property of the dictionary. That occasionally results in race conditions, such as the following one:Fix
This PR changes
deduplicationMap
into aConcurrentDictionary
, usingGetOrAdd
instead of locks on the object. A similar fix was done for a different dictionary in https://github.com/KirillOsenkov/MSBuildStructuredLog/pull/658.