KirillOsenkov / MSBuildStructuredLog

A logger for MSBuild that records a structured representation of executed targets, tasks, property and item values.
MIT License
1.41k stars 188 forks source link

Fix race condition in StringCache #672

Closed antonioaversa closed 1 year ago

antonioaversa commented 1 year ago

Context

Microsoft.Build.Logging.StructuredLogger.StringCache.Intern uses a non-concurrent Dictionary instance, deduplicationMap, to store strings. Accesses to deduplicationMap is currently done in lock statements, in the Intern and Contains methods. However, StringCache.Instances stores and publicly exposes the Keys property of the dictionary. That occasionally results in race conditions, such as the following one:

System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
   at System.Collections.Generic.Dictionary`2.FindValue(TKey key)
   at System.Collections.Generic.Dictionary`2.TryGetValue(TKey key, TValue& value)
   at Microsoft.Build.Logging.StructuredLogger.StringCache.Intern(String text) in C:\MSBuildStructuredLog\src\StructuredLogger\Serialization\StringCache.cs:line 69
   at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(TreeNode parent, IEnumerable`1 properties, IProjectOrEvaluation project) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 1099
   at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(Folder propertiesFolder, TreeNode project, IEnumerable properties) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 984
   at Microsoft.Build.Logging.StructuredLogger.Construction.<>c__DisplayClass51_0.<UpdateProject>b__0() in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 890
...

Fix

This PR changes deduplicationMap into a ConcurrentDictionary, using GetOrAdd instead of locks on the object. A similar fix was done for a different dictionary in https://github.com/KirillOsenkov/MSBuildStructuredLog/pull/658.

KirillOsenkov commented 1 year ago

Thanks!

antonioaversa commented 1 year ago

@KirillOsenkov Thanks for reviewing and merging this: that was fast! Would you be able to release a new version of the MSBuildStructuredLog: we are occasionally reproducing this issue in our CI pipeline and would like have it fixed. Thanks!

KirillOsenkov commented 1 year ago

yes will try to get to it today

KirillOsenkov commented 1 year ago

published 2.1.814, thanks again!

KirillOsenkov commented 1 year ago

When you were seeing the race conditions initially, did the build you were using contain the lock statements I added in https://github.com/KirillOsenkov/MSBuildStructuredLog/commit/d1c4eff09277acf4f5afb0aae47180bf077ec442?

KirillOsenkov commented 1 year ago

If possible, if you could debug that race condition under the VS debugger, and when the exception happens, print the call stacks of threads accessing the dictionary.

The Instances property is only used during search, which shouldn't happen during binlog loading time.

KirillOsenkov commented 1 year ago

I found what was wrong here: https://github.com/KirillOsenkov/MSBuildStructuredLog/issues/673#issuecomment-1518466318

It's subtle and not easy to see, so not blaming you at all. We relied on the fact that Dictionary<K,V>.Keys never changes, but for ConcurrentDictionary<K,V>.Keys returns a different read-only collection every time you call it. Makes sense in hindsight.

Let's confirm that using a simple Dictionary with locks fixes your race conditions. I made that commit on March 26: https://github.com/KirillOsenkov/MSBuildStructuredLog/commit/d1c4eff09277acf4f5afb0aae47180bf077ec442

whereas the previous release was on March 13: https://github.com/KirillOsenkov/MSBuildStructuredLog/releases/tag/v2.1.790

So I think what happened is that you saw the race conditions without the lock that I added on March 26.

I will be reluctant to accept another change to ConcurrentDictionary, because now that I think about it, there might be performance implications for both binlog loading time as well as search. We'd need to carefully measure the binlog loading perf for large binlogs for both approaches, as well as search for strings in a large binlog, and see which approach is faster.

antonioaversa commented 1 year ago

When you were seeing the race conditions initially, did the build you were using contain the lock statements I added in d1c4eff?

The stack trace reported here (full version of the stack trace at the end of this message) was generated by an execution in our CI infrastructure, using version 2.1.630, which indeed didn't include d1c4eff.

We have also reproduced the issue with 2.1.790 (latest released version before 2.1.814), on a binlog file for a different project (nodatime). That version didn't include the lock either.

However, I have reproduce a similar race condition locally, on top of 0ab53ca (which does include d1c4eff, with the following test:

[Fact]
public void Intern()
{
    var cache = new StringCache() { DisableDeduplication = false };
    var internerThread = new System.Threading.Thread(() =>
    {
        for (var i = 0L; i < long.MaxValue; i++)
        {
            cache.Intern(i.ToString());
            //System.Threading.Thread.Sleep(10);
        }
    });
    var instanceAccessThread = new System.Threading.Thread(() =>
    {
        for (var i = 0L; i < long.MaxValue; i++)
        {
            System.Threading.Tasks.Parallel.ForEach(cache.Instances, s => _ = s.IndexOf("0"));
            //System.Threading.Thread.Sleep(10);
        }
    });

    internerThread.Start();
    instanceAccessThread.Start();
    internerThread.Join();
    instanceAccessThread.Join();
}

Running this test in debug from Visual Studio produces InvalidOperationException: Collection was modified; enumeration operation may not execute., which is seems to be fixed with a ConcurrentDictionary (running correctly for many iterations of i).

This test code, however, is much simpler and different from the Parallel.ForEach executed in Microsoft.Build.Logging.StructuredLogger.Construction, that is generating the exception detected in our CI.

So I am going to update our CI to use 2.1.815 and report a new issue in case we can still see a race condition happening.


The line in our code, generating the InvalidOperationException is:

var root = BinaryLog.ReadBuild(filePath);

Here is the full stack trace:

Unhandled exception. System.AggregateException: One or more errors occurred. (One or more errors occurred. (One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)) (One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)))
 ---> System.AggregateException: One or more errors occurred. (One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)) (One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.))
 ---> System.AggregateException: One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)
 ---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
   at System.Collections.Generic.Dictionary`2.FindValue(TKey key)
   at System.Collections.Generic.Dictionary`2.TryGetValue(TKey key, TValue& value)
   at Microsoft.Build.Logging.StructuredLogger.StringCache.Intern(String text) in C:\MSBuildStructuredLog\src\StructuredLogger\Serialization\StringCache.cs:line 69
   at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(TreeNode parent, IEnumerable`1 properties, IProjectOrEvaluation project) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 1099
   at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(Folder propertiesFolder, TreeNode project, IEnumerable properties) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 984
   at Microsoft.Build.Logging.StructuredLogger.Construction.<>c__DisplayClass51_0.<UpdateProject>b__0() in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 890
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at Microsoft.Build.Logging.StructuredLogger.Construction.<>c.<.ctor>b__17_1(Task task) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 79
   at System.Threading.Tasks.Parallel.<>c__DisplayClass44_0`2.<PartitionerForEachWorker>b__1(IEnumerator& partitionState, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Parallel.<>c__DisplayClass44_0`2.<PartitionerForEachWorker>b__1(IEnumerator& partitionState, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica`1.ExecuteAction(Boolean& yieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica.Execute()
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.TaskReplicator.Run[TState](ReplicatableUserAction`1 action, ParallelOptions options, Boolean stopOnFirstFailure)
   at System.Threading.Tasks.Parallel.PartitionerForEachWorker[TSource,TLocal](Partitioner`1 source, ParallelOptions parallelOptions, Action`1 simpleBody, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Parallel.ThrowSingleCancellationExceptionOrOtherException(ICollection exceptions, CancellationToken cancelToken, Exception otherException)
   at System.Threading.Tasks.Parallel.PartitionerForEachWorker[TSource,TLocal](Partitioner`1 source, ParallelOptions parallelOptions, Action`1 simpleBody, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable`1 source, Action`1 body)
   at Microsoft.Build.Logging.StructuredLogger.Construction.<.ctor>b__17_0() in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 76
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
 ---> (Inner Exception #1) System.AggregateException: One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)
 ---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
   at System.Collections.Generic.Dictionary`2.FindValue(TKey key)
   at System.Collections.Generic.Dictionary`2.TryGetValue(TKey key, TValue& value)
   at Microsoft.Build.Logging.StructuredLogger.StringCache.Intern(String text) in C:\MSBuildStructuredLog\src\StructuredLogger\Serialization\StringCache.cs:line 69
   at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(TreeNode parent, IEnumerable`1 properties, IProjectOrEvaluation project) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 1099
   at Microsoft.Build.Logging.StructuredLogger.Construction.AddProperties(Folder propertiesFolder, TreeNode project, IEnumerable properties) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 984
   at Microsoft.Build.Logging.StructuredLogger.Construction.<>c__DisplayClass51_0.<UpdateProject>b__0() in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 890
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at Microsoft.Build.Logging.StructuredLogger.Construction.<>c.<.ctor>b__17_1(Task task) in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 79
   at System.Threading.Tasks.Parallel.<>c__DisplayClass44_0`2.<PartitionerForEachWorker>b__1(IEnumerator& partitionState, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Parallel.<>c__DisplayClass44_0`2.<PartitionerForEachWorker>b__1(IEnumerator& partitionState, Int32 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica`1.ExecuteAction(Boolean& yieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica.Execute()<---
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at Microsoft.Build.Logging.StructuredLogger.Construction.Shutdown() in C:\MSBuildStructuredLog\src\StructuredLogger\Construction\Construction.cs:line 88
   at Microsoft.Build.Logging.StructuredLogger.StructuredLogger.Shutdown() in C:\MSBuildStructuredLog\src\StructuredLogger\StructuredLogger.cs:line 102
   at Microsoft.Build.Logging.StructuredLogger.BinaryLog.ReadBuild(Stream stream, Progress progress, Byte[] projectImportsArchive) in C:\MSBuildStructuredLog\src\StructuredLogger\BinaryLog.cs:line 123
   at Microsoft.Build.Logging.StructuredLogger.BinaryLog.ReadBuild(String filePath, Progress progress) in C:\MSBuildStructuredLog\src\StructuredLogger\BinaryLog.cs:line 41
   at Microsoft.Build.Logging.StructuredLogger.BinaryLog.ReadBuild(String filePath) in C:\MSBuildStructuredLog\src\StructuredLogger\BinaryLog.cs:line 28
   at ParseBuildOutput.LogParsing.LogParser..ctor(String filePath, String projectName) in C:\Windows\TEMP\cirrus-ci-build\ParseBuildOutput\ParseBuildOutput\LogParsing\LogParser.cs:line 47
   at ParseBuildOutput.Program.Process(Options options) in C:\Windows\TEMP\cirrus-ci-build\ParseBuildOutput\ParseBuildOutput\Program.cs:line 28
   at CommandLine.ParserResultExtensions.WithParsed[T](ParserResult`1 result, Action`1 action)
   at ParseBuildOutput.Program.Main(String[] args) in C:\Windows\TEMP\cirrus-ci-build\ParseBuildOutput\ParseBuildOutput\Program.cs:line 24
antonioaversa commented 1 year ago

I will be reluctant to accept another change to ConcurrentDictionary, because now that I think about it, there might be performance implications for both binlog loading time as well as search. We'd need to carefully measure the binlog loading perf for large binlogs for both approaches, as well as search for strings in a large binlog, and see which approach is faster.

Indeed. I would say that even a simple lock of the deduplicationMap, as done in d1c4eff, may have already had significant performance implications. Since our update to 2.1.790 in our CI infrastructure, done on April 17, we have noticed a fair amount of timeout when reading binlog files. Timeouts come after 50min-1h, on binlog files smaller than a MB and which are read successfully in seconds, when read locally.

That may be an indication of more multithreading issues: we will collect more data and report if we keep observing the behavior and we gain insight on the possible cause of the issue.

KirillOsenkov commented 1 year ago

Yes, the change to make binlog loading parallel is recent, and I didn't review it very thoroughly. It does seem like it needs to be ironed out.

Agreed that your unit-test reproduces the issue. However in real life the Instances collection is only accessed from two places: from the search function and the find in files function. Both happen after the binlog has loaded. We are not accessing the Instances collection (which gets the Keys) during the binlog loading.

I'd like to know more about the timeouts you're seeing. It may be that it's a separate bug, I'd be happy to help you investigate those.