dotnet / msbuild

The Microsoft Build Engine (MSBuild) is the build platform for .NET and Visual Studio.
https://docs.microsoft.com/visualstudio/msbuild/msbuild
MIT License
5.21k stars 1.35k forks source link

SdkResolverService throws "System.InvalidOperationException: Collection was modified" occasionally #7927

Open dfederm opened 2 years ago

dfederm commented 2 years ago

I'm occasionally seeing the following exception:

System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at Microsoft.Build.BackEnd.SdkResolution.SdkResolverService.GetResolvers(IList`1 resolversManifests, LoggingContext loggingContext, ElementLocation sdkReferenceLocation)
   at Microsoft.Build.BackEnd.SdkResolution.SdkResolverService.ResolveSdkUsingResolversWithPatternsFirst(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio)
   at Microsoft.Build.BackEnd.SdkResolution.SdkResolverService.ResolveSdk(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio)
   at Microsoft.Build.BackEnd.SdkResolution.CachingSdkResolverService.<>n__0(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio)
   at Microsoft.Build.BackEnd.SdkResolution.CachingSdkResolverService.<>c__DisplayClass3_0.<ResolveSdk>b__2()
   at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)
--- End of stack trace from previous location ---
   at System.Lazy`1.CreateValue()
   at Microsoft.Build.BackEnd.SdkResolution.CachingSdkResolverService.ResolveSdk(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio)
   at Microsoft.Build.Evaluation.Evaluator`4.ExpandAndLoadImportsFromUnescapedImportExpressionConditioned(String directoryOfImportingFile, ProjectImportElement importElement, List`1& projects, SdkResult& sdkResult, Boolean throwOnFileNotExistsError)
   at Microsoft.Build.Evaluation.Evaluator`4.ExpandAndLoadImports(String directoryOfImportingFile, ProjectImportElement importElement, SdkResult& sdkResult)
   at Microsoft.Build.Evaluation.Evaluator`4.EvaluateImportElement(String directoryOfImportingFile, ProjectImportElement importElement)
   at Microsoft.Build.Evaluation.Evaluator`4.PerformDepthFirstPass(ProjectRootElement currentProjectOrImport)
   at Microsoft.Build.Evaluation.Evaluator`4.EvaluateImportElement(String directoryOfImportingFile, ProjectImportElement importElement)
   at Microsoft.Build.Evaluation.Evaluator`4.PerformDepthFirstPass(ProjectRootElement currentProjectOrImport)
   at Microsoft.Build.Evaluation.Evaluator`4.Evaluate()
   at Microsoft.Build.Evaluation.Project.ProjectImpl.Reevaluate(ILoggingService loggingServiceForEvaluation, ProjectLoadSettings loadSettings, EvaluationContext evaluationContext)
   at Microsoft.Build.Evaluation.Project.ProjectImpl.ReevaluateIfNecessary(ILoggingService loggingServiceForEvaluation, ProjectLoadSettings loadSettings, EvaluationContext evaluationContext)
   at Microsoft.Build.Evaluation.Project.ProjectImpl.ReevaluateIfNecessary(ILoggingService loggingServiceForEvaluation, EvaluationContext evaluationContext)
   at Microsoft.Build.Evaluation.Project.ProjectImpl.ReevaluateIfNecessary(EvaluationContext evaluationContext)
   at Microsoft.Build.Evaluation.Project.ProjectImpl.Initialize(IDictionary`2 globalProperties, String toolsVersion, String subToolsetVersion, ProjectLoadSettings loadSettings, EvaluationContext evaluationContext)
   at Microsoft.Build.Evaluation.Project..ctor(ProjectRootElement xml, IDictionary`2 globalProperties, String toolsVersion, String subToolsetVersion, ProjectCollection projectCollection, ProjectLoadSettings loadSettings, EvaluationContext evaluationContext, IDirectoryCacheFactory directoryCacheFactory)
   at Microsoft.Build.Evaluation.Project.FromProjectRootElement(ProjectRootElement rootElement, ProjectOptions options)

Seems like there is a race condition in SdkResolverService.

MSBuild version: 17.4.0-preview-22416-02 (pretty close to head of main as of this writing)

dfederm commented 2 years ago

FYI @dsplaisted

dfederm commented 2 years ago

My guess based on inspection the code is that this is the race:

  1. Multiple threads get to ResolveSdkUsingResolversWithPatternsFirst at the same time.
  2. Both thread 1 and thread 2 see _generalResolversManifestsRegistry is null and enter RegisterResolversManifests
  3. Thread 1 populates _generalResolversManifestsRegistry inside RegisterResolversManifests. Note that this is under a lock.
  4. Thread 1 returns from RegisterResolversManifests
  5. Thread 2 obtains the lock and creates _generalResolversManifestsRegistry again and starts to add to it.
  6. Thread 1 enteres the 2nd GetResolvers call which passes _generalResolversManifestsRegistry. Note! This is not under a lock!
  7. Thread 1 in GetResolvers starts to iterate the list passed in (_generalResolversManifestsRegistry)
  8. Thread 2 adds to _generalResolversManifestsRegistry
  9. Thread 1 continues to iterate the original _generalResolversManifestsRegistry and throws.

So ultimately, the bug is that _generalResolversManifestsRegistry is only locked when writing, but not when reading.

dfederm commented 2 years ago

Code with the bug was introduced in #7597 by @AR-May

AR-May commented 2 years ago

@dfederm thank you for reporting the bug and for analysis. The scenario above should not be possible. The _generalResolversManifestsRegistry is double-checked on creation: the RegisterResolversManifests has a check after a lock to prevent the second creation. So, thread 2 should just return from the function without rewriting the collection. After creation this collection was not supposed to be modified, so I thought it should be ok not to lock it on reading, since it happens only after a creation. I would dig into this.

japj commented 6 months ago

I saw this on MSBuild version = "17.8.3+195e7f5a3" today

error MSB4014: The build stopped unexpectedly because of an internal failure.
System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at Microsoft.Build.BackEnd.SdkResolution.SdkResolverService.ResolveSdkUsingResolversWithPatternsFirst(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio, Boolean failOnUnresolvedSdk)
   at Microsoft.Build.BackEnd.SdkResolution.SdkResolverService.ResolveSdk(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio, Boolean failOnUnresolvedSdk)
   at Microsoft.Build.BackEnd.SdkResolution.CachingSdkResolverService.<>c__DisplayClass3_0.<ResolveSdk>b__2()
   at System.Lazy`1.CreateValue()
   at System.Lazy`1.LazyInitValue()
   at Microsoft.Build.BackEnd.SdkResolution.CachingSdkResolverService.ResolveSdk(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio, Boolean failOnUnresolvedSdk)
   at Microsoft.Build.BackEnd.SdkResolution.MainNodeSdkResolverService.PacketReceived(Int32 node, INodePacket packet)

and

error MSB4014: The build stopped unexpectedly because of an internal failure.
System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at Microsoft.Build.BackEnd.SdkResolution.SdkResolverService.ResolveSdkUsingResolversWithPatternsFirst(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio, Boolean failOnUnresolvedSdk)
   at Microsoft.Build.BackEnd.SdkResolution.SdkResolverService.ResolveSdk(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio, Boolean failOnUnresolvedSdk)
   at Microsoft.Build.BackEnd.SdkResolution.CachingSdkResolverService.<>c__DisplayClass3_0.<ResolveSdk>b__2()
   at System.Lazy`1.CreateValue()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Lazy`1.get_Value()
   at Microsoft.Build.BackEnd.SdkResolution.CachingSdkResolverService.ResolveSdk(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio, Boolean failOnUnresolvedSdk)
   at Microsoft.Build.BackEnd.SdkResolution.MainNodeSdkResolverService.PacketReceived(Int32 node, INodePacket packet)
   at System.Lazy`1.get_Value()
   at Microsoft.Build.BackEnd.SdkResolution.CachingSdkResolverService.ResolveSdk(Int32 submissionId, SdkReference sdk, LoggingContext loggingContext, ElementLocation sdkReferenceLocation, String solutionPath, String projectPath, Boolean interactive, Boolean isRunningInVisualStudio, Boolean failOnUnresolvedSdk)
   at Microsoft.Build.BackEnd.SdkResolution.MainNodeSdkResolverService.PacketReceived(Int32 node, INodePacket packet)
KalleOlaviNiemitalo commented 6 months ago

From the source code, it looks like this could happen:

  1. on thread A, ResolveSdkUsingResolversWithPatternsFirst checks_generalResolversManifestsRegistry == null, gets true, and calls RegisterResolversManifests
  2. on thread A, RegisterResolversManifests locks _lockObject
  3. on thread A, RegisterResolversManifests checks _generalResolversManifestsRegistry != null, gets false, and proceeds
  4. on thread A, RegisterResolversManifests does other stuff and then assigns _generalResolversManifestsRegistry = new List();
  5. on thread B, ResolveSdkUsingResolversWithPatternsFirst checks_generalResolversManifestsRegistry == null, and does not call RegisterResolversManifests, and does not lock _lockObject
  6. on thread B, ResolveSdkUsingResolversWithPatternsFirst does other stuff and then calls GetResolvers(_generalResolversManifestsRegistry, loggingContext, sdkReferenceLocation)
  7. on thread B, GetResolvers enters the foreach (var resolverManifest in resolversManifests) loop and gets an enumerator
  8. on thread A, RegisterResolversManifests calls _generalResolversManifestsRegistry.Add(manifest), which invalidates the enumerator
  9. on thread B, GetResolvers attempts to use the enumerator, which is now invalid and throws

In this scenario, the problem is that, although RegisterResolversManifests locks _lockObject to prevent other threads from initializing the manifest registries in parallel, it assigns the list to _generalResolversManifestsRegistry before it has finished adding the manifests to it. Other threads can then read the list reference from _generalResolversManifestsRegistry, assume that the list will no longer be modified, and attempt to enumerate it. To fix this, RegisterResolversManifests should store the lists to local variables first, populate them there, and assign to fields only just before unlocking _lockObject. I'm not sure whether these assignments would need to be volatile according to the memory model.

japj commented 5 months ago

just a fyi, I ran into this again on a build today

MaxClaessen commented 1 month ago

we have seen this issue a few times as well in our builds