dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
19.02k stars 4.03k forks source link

Crash on arm/arm64 with mono due to missing memory barriers #24932

Closed BrzVlad closed 2 years ago

BrzVlad commented 6 years ago

Version Used:

2.6.0. Doesn't matter

Steps to Reproduce:

I've seen random roslyn crashes / compilation errors with mono ever since switching to it for building the BCL (on arm/arm64 linuxes) For reproduction you just need to build mono, but the bug is extremely evasive. I have reliably reproduced it on a scaleway.com arm64 vm and also on a compulab utilite2. I can provide access to a machine where it should be easy to reproduce.

Expected Behavior:

Don't crash

Issue:

As far as I've investigated the crashes and my small understanding of object pools, it looks like missing memory barriers here https://github.com/dotnet/roslyn/blob/master/src/Dependencies/PooledObjects/ObjectPool%601.cs#L194

The thread that owns the object and puts it back for reuse doesn't use any synchronisation primitives, so the thread that will reuse this freed object is not guaranteed to see the correct state of the freed object.

marek-safar commented 6 years ago

@jaredpar who owns this code?

jaredpar commented 6 years ago

@marek-safar this is owned by the compiler. Will let @gafter, @tmat, @VSadov comment on this.

akoeplinger commented 6 years ago

Making _firstItem volatile fixes the crash (confirmed on the Scaleway Cavium ThunderX box):

-        private T _firstItem;
+        private volatile T _firstItem;
VSadov commented 6 years ago

The reusing thread will have to do Interlocked.CompareExchange to check out the instance. That is a full fence. If the pool did not syncronize, it would be broken on any platform.

Volatile may effect timing. It could get pretty expensive on ARM.

My guess, it is a race somewhere else and making things slower “fixes” them.

VSadov commented 6 years ago

Reference write should be a “release” anyways. Interlocked is used here because of object uniqueness.

If reference writes to heap are not releases, a lot of things would be broken.

Roslyn assumes CLR2.0 memory model. - at least tries to. Since we test on x86/64 only, we cant be sure it there are no accidental strong ordering dependencies.

It could be one of those.

Or just a race which we do not see on other platforms due to timing.

BrzVlad commented 6 years ago

@vsadov It doesn't matter if the reusing thread does a CAS and adds a memory barrier, since the memory barrier orders only on the calling thread. The stores on the thread that released the object can still be committed in any order.

VSadov commented 6 years ago

One of the best explanation of CLR2.0 memory ordering that I know is http://joeduffyblog.com/2007/11/10/clr-20-memory-model/

It does not mention threads. In that model a fence is a fence and effects are global.

It does, however, mention that the description is a practical rationalization of the actual implementation and it is unclear whether Mono provides the same guarantees. So there is a chance that we are observing platform/JIT differences.

VSadov commented 6 years ago

It still seems more likely that we just have an ordinary race somewhere...

xoofx commented 5 years ago

FYI, just to bump this, related to a compilation problem I have when running mono msbuild and Roslyn on an Raspberry ARM64, mono+Roslyn is giving the following exception:

     System.NullReferenceException: Object reference not set to an instance of an object
          at Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol.ComputeStrongNameKeys () <0xffff85aae7f8 + 0x000e8> in <b7d368a2667f4dca818a47370ae17a21#b0cfdf5656c3c230b3b1cf9038b91924>:0
          at Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol.get_StrongNameKeys () <0xffff85aae994 + 0x00087> in <b7d368a2667f4dca818a47370ae17a21#b0cfdf5656c3c230b3b1cf9038b91924>:0
          at Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol.ValidateAttributeSemantics (Microsoft.CodeAnalysis.DiagnosticBag diagnostics) <0xffff85aaec40 + 0x00023> in <b7d368a2667f4dca818a47370ae17a21#b0cfdf5656c3c230b3b1cf9038b91924>:0
          at Microsoft.CodeAnalysis.CSharp.Symbols.SourceAssemblySymbol.ForceComplete (Microsoft.CodeAnalysis.SourceLocation locationOpt, System.Threading.CancellationToken cancellationToken) <0xffff85aafcf0 + 0x000e7> in <b7d368a2667f4dca818a47370ae17a21#b0cfdf5656c3c230b3b1cf9038b91924>:0

Which seems to be related to this issue https://bugzilla.xamarin.com/show_bug.cgi?id=56546#c2

I don't see how to pass the /parallel- switch through msbuild Csc task, if anyone has a clue.

For what is worth, dotnet 3.0 is running fine with csc, but the fact is that it is running super slow on ARM64, so the concurrency bug might not be visible with it

VSadov commented 5 years ago

FWIW - the only reasonably testable memory model at the time was CLR2.0, so Roslyn targets that. It was an explicit decision. (discussed in meetings and such). In particular reference stores are assumed to have release semantics (as in std::memory_order_release).

Roslyn has a lot of shared state - basically embraces it. You will see variations of the following pattern all over the code base:

// naked store!
//  `stlr` is assumed  (or `dmb ish; str`, on v7)
maybe_read_in_other_threads = new Something();

If the store-release assumption above is weakened, a good deal of codebase may need to be reviewed and carefully volatile-ized. It should be doable, but could be tedious.

CoffeeFlux commented 5 years ago

Pinging this thread again as this is still a pain point for Mono CI. Most of the random CI failures I get are from CodeAnalysis on ARM, and so I end up re-running lanes a whole bunch as a result. I believe we're working on a hack on our end to minimize the issue, but wanted to see if there's any chance of this being addressed upstream.

jaredpar commented 5 years ago

The Roslyn compiler, and the CoreFX libraries we depend on, are designed for the CLR memory model. Moving it to support the Mono / ECMA memory model is a non-trivial task. It's on the order of one to two months to get it done. This is not to say we should or should not do it, just that the work is much more significant than this thread makes it out to be. It's the equivalent of a medium sized language feature.

Stepping back though I'm not sure changing Roslyn is the right approach here given our .NET 5 plans. The idea is to make it so customers can use Mono as a runtime for their .NET applications. Are we suggesting that all customers need to make the same memory model adjustments as Roslyn is being asked to do? Or should we be looking to get Mono in line with the CLR memory model here?

selalerer commented 4 years ago

I am not sure it is related to this issue but we try to use CSharpCompilation.Emit() on ARM64 and sometimes our process crash. Once at least it outputted this error:

error CS8113: Invalid hash algorithm name: 'System.Byte[]'

The same code works consistently on Windows.

BrzVlad commented 2 years ago

Doesn't look like this issue should be open. We ended up adding barriers on the runtime side on mono.