dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.94k stars 4.64k forks source link

[ARM] Potential codegen issues on Graviton3 #100705

Closed redknightlois closed 3 months ago

redknightlois commented 5 months ago

Description

We have included Graviton3 chipsets on our testing matrix and found many different tests that error consistently. However, those errors are not triggering when run in the following platforms:

Moreover, when run continuously errors only trigger after 2 conditions:

At first we suspected that we were doing something wrong on our end [its possible we are corrupting the heap somehow] but the deterministic nature of the failure and repeatability after it triggers for the first time makes us suspect it is something more permanent than a heap corruption.

I include several different ways to reproduce the errors.

Reproduction Steps

Download any of the following reproductions:

https://github.com/redknightlois/ravendb/tree/numerical-difference
https://github.com/redknightlois/ravendb/tree/failed-datetime-comparison
https://github.com/redknightlois/ravendb/tree/failed-patching

Run them in order to ensure that optimizations will kick in:

cd test/Tryouts
dotnet run -c Release

Observe the errors after a set amount of loops.

Expected behavior

They should execute the 1000 loops without printing any exception.

Actual behavior

Trigger exception repeatedly after a set amount of loops.

Regression?

No response

Known Workarounds

None

Configuration

.NET SDK:
 Version:           8.0.103
 Commit:            6a90b4b4bc
 Workload version:  8.0.100-manifests.e99a2be4

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  22.04
 OS Platform: Linux
 RID:         ubuntu.22.04-arm64
 Base Path:   /usr/lib/dotnet/sdk/8.0.103/

.NET workloads installed:
 Workload version: 8.0.100-manifests.e99a2be4
There are no installed workloads to display.

Host:
  Version:      8.0.3
  Architecture: arm64
  Commit:       9f4b1f5d66

.NET SDKs installed:
  8.0.103 [/usr/lib/dotnet/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 7.0.17 [/usr/lib/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 8.0.3 [/usr/lib/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 7.0.17 [/usr/lib/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 8.0.3 [/usr/lib/dotnet/shared/Microsoft.NETCore.App]

Other information

No response

EgorBo commented 5 months ago

Would be nice to have more details - does it fail on other ARM64 machines, which tests and what is the expected/actual inputs/outputs? Is it floating point related? E.g. we do not guarantee that all floating-point related operations e.g. casts to integers, edge cases in Math.* calls are always 100% the same for the same input on all architectures/platforms. Would be nice to confirm that the issue is indeed on dotnet/runtime side.

redknightlois commented 5 months ago

Some are binding errors at the runtime level which points into heap corruptions. One is showing 2 floating point numbers which is clearly not a floating point issue, more like getting the data from a different memory location.

I only have Graviton3 from AWS on ARM64, it is consistently failing on all of the machines provisioned. All the tests execute properly on Linux and Windows on 32-bits and 64-bit Intel.

That's why I provided 3 different ways to trigger the behavior with completely different error behaviors. But since, they all trigger after a number of passes happen (and the number is consistent across runs) and they all run successfully run under Debug build, it leads into suspect that after a certain optimization is performed, the errors trigger.

ayende commented 5 months ago

Okay, I have a much smaller scope for reproducing this. Here is the relevant code (running as Tryout/Program.cs):

using System;
using System.Diagnostics;
using System.IO;
using System.Net.Http.Json;
using System.Runtime.CompilerServices;
using System.Threading.Tasks;
using Amazon.S3.Model;
using Tests.Infrastructure;
using Raven.Server.Utils;
using SlowTests.Corax;
using SlowTests.Sharding.Cluster;
using Xunit;
using FastTests.Voron.Util;
using Jint;
using Jint.Native;
using Jint.Native.Function;
using Jint.Native.Object;
using Raven.Server.Config;
using Raven.Server.Documents;
using Raven.Server.Documents.Patch;
using Raven.Server.Documents.Queries.LuceneIntegration;
using Sparrow.Json;

namespace Tryouts;

public static class Program
{
    static Program()
    {
        XunitLogging.RedirectStreams = false;
    }

    public class MyKey : ScriptRunnerCache.Key
    {
        public override void GenerateScript(ScriptRunner runner)
        {
            runner.AddScript("function project(p) { return { Long: p.Long, Double: p.Double }; } ");
        }

        public override bool Equals(object obj)
        {
            return ReferenceEquals(this, obj);
        }

        public override int GetHashCode()
        {
            return RuntimeHelpers.GetHashCode(this);
        }
    }

    public static void Main(string[] args)
    {

        using var ctx = JsonOperationContext.ShortTermSingleUse();

        var ms = new MemoryStream("{'Long': 9007199254740990, 'Double': 1.7976931348623157E+308}"u8.ToArray());
        var doc = new Document { Data = ctx.ReadForMemoryAsync(ms, "test").Result };

        var func = Jint.Engine.PrepareScript("function project(p) { return { Long: p.Long, Double: p.Double }; } ", strict: false);
        var engine = new Jint.Engine();
        var jsonStr = (Function)engine.GetValue("JSON").AsObject().Get("stringify");
        engine.Execute(func);

        var method = engine.GetValue("project");
        const string expected = "{\"Long\":9007199254740990,\"Double\":1.7976931348623157E+308}";

        for (int i = 0; i < 100_000; i++)
        {
            var v = method.Call(new BlittableObjectInstance(engine, null, doc.Data, doc));

            var j = JsBlittableBridge.Translate(ctx, engine, (ObjectInstance)v).ToString();
            if (j != expected)
            {
                Console.WriteLine(i + " " + j);
                return;
            }
        }

        Console.WriteLine("Done");
    }
}

To make things easier, what this does is execute this funciton:

function project(p) { return { Long: p.Long, Double: p.Double }; }

Over this input:

{'Long': 9007199254740990, 'Double': 1.7976931348623157E+308}

Will sometimes return:

ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c release
15862 {"Long":9007199254740990.0,"Double":9223372036854775807}
ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c release
18067 {"Long":9007199254740990.0,"Double":9223372036854775807}
ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c release
20129 {"Long":9007199254740990.0,"Double":9223372036854775807}

Note that the number of iterations is different.

I am assuming that this is tier JIT running and a bug happening there somewhere.

I wasn't able to simplify this further, I'm afraid.

When running this with debug mode, it completes successfully each time.

ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c debug
Done
ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c debug
Done

Once this happens, it is entirely reproducible.

redknightlois commented 5 months ago

We dont see the error immediately because of tiered compilation, but it fails immediately on the first iteration when disabled.

  <PropertyGroup>
    <TieredCompilation>false</TieredCompilation>
  </PropertyGroup>

This with a simplifier reproduction that barely needs anything confirms the suspicion that the error is related to code generation on ARM64 platform.

When we disable quick jit even if tiered compilation is on, we also fail on first iteration.

  <PropertyGroup>
    <TieredCompilation>true</TieredCompilation>
     <TieredCompilationQuickJit>false</TieredCompilationQuickJit>
  </PropertyGroup>

However, when we disable only quick jit for loops we revert to fail after a set amount of iterations again.

  <PropertyGroup>
    <TieredCompilation>true</TieredCompilation>
     <TieredCompilationQuickJit>true</TieredCompilationQuickJit>
     <TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>
  </PropertyGroup>
AndyAyersMS commented 4 months ago

@EgorBo can you see if the repro above works for us? If not I can look later this week.

ayende commented 4 months ago

Some more information on this.

A user is reporting that running RavenDB or Mac M2 will crash with:

2024-04-16 13:28:58 assertion failed [block != nullptr]: BasicBlock requested for unrecognized address
2024-04-16 13:28:58 (BuilderBase.h:550 block_for_offset)

When running docker run ravendb/ravendb:5.4.116-ubuntu.22.04-arm64v8

When setting DOTNET_TC_QuickJitForLoops=0, it goes further and fails with:

ravendb exited with code 139
ayende commented 4 months ago

We tried with:

DOTNET_TieredCompilation=0
DOTNET_TC_QuickJit=0
DOTNET_TC_QuickJitForLoops=0

And the user is getting:

ravendb      | Running non-interactive.

ravendb      | assertion failed [block != nullptr]: BasicBlock requested for unrecognized address

ravendb      | (BuilderBase.h:550 block_for_offset)

ravendb      |

ravendb exited with code 133

Is there any way for us to extract additional information here?

EgorBo commented 4 months ago

@EgorBo can you see if the repro above works for us? If not I can look later this week.

I managed to repro it using dotnet run from SDK, but it never reproduces on locally build runtime (either net9 or net8, either Release or Checked) so I am a bit puzzled, the project is quite huge - does it spawn processes internally, does it try to build some managed code dynamically on run?

PS: it doesn't even repro when I copy my locally built net8.0 bits (from release branch) to the .net 8.0 sdk's folder

jkotas commented 4 months ago

assertion failed [block != nullptr]: BasicBlock requested for unrecognized address (BuilderBase.h:550 block_for_offset)

What is the component that contains this assert check? (There is no assert check like this in .NET runtime.)

filipnavara commented 4 months ago

What is the component that contains this assert check?

It's from the Rosetta emulation layer on macOS.

EgorBo commented 4 months ago

PS: My comment about dotnet run was for the problem with the repro printing:

{'Long': 9007199254740990, 'Double': 1.7976931348623157E+308}

instead of:

Done

In Release, not for the Rosetta issue which I guess is a separate issue - I was running the repro on Linux-arm64

ayende commented 4 months ago

When running dotnet run for the Tryout project, there shouldn't be any code being generated. This is running a single threaded code and most of the code there is never invoked.

Note that I don't know if both of those issues are related. They are both with ARM64 and the env vars make it looks like it has the same impact, but may be two different items.

AndyAyersMS commented 4 months ago

I take it this doesn't (yet) work on windows arm64?

Starting to run 0
Default HTTP Compression Algorithm: Gzip
Default HTTP Pooled Connection Idle Timeout:
Default HTTP Version Policy:
Max number of concurrent tests is: 4
System.TypeInitializationException: The type initializer for 'Raven.Server.Documents.Operations.OperationsStorage' threw an exception.
 ---> System.TypeInitializationException: The type initializer for 'Voron.StorageEnvironment' threw an exception.
 ---> System.TypeInitializationException: The type initializer for 'Sparrow.Server.Platform.Pal' threw an exception.
 ---> Sparrow.Utils.IncorrectDllException: librvnpal version might be invalid, missing or not usable on current platform. Initialization error could also be caused by missing 'Microsoft Visual C++ 2015 Redistributable Package' (or newer). It can be downloaded from https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads. Arch: Arm64, OSDesc: Microsoft Windows 10.0.22635
 ---> System.BadImageFormatException: An attempt was made to load a program with an incorrect format. (0x8007000B)
   at System.Runtime.InteropServices.NativeLibrary.LoadLibraryByName(String libraryName, Assembly assembly, Nullable`1 searchPath, Boolean throwOnError)
   at Sparrow.Utils.DynamicNativeLibraryResolver.Resolver(String libraryName, Assembly assembly, Nullable`1 dllImportSearchPath) in C:\repos\ravendb\src\Sparrow\Utils\DynamicNativeLibraryResolver.cs:line 70
   at System.Runtime.InteropServices.NativeLibrary.LoadLibraryCallbackStub(String libraryName, Assembly assembly, Boolean hasDllImportSearchPathFlags, UInt32 dllImportSearchPathFlags)
   at Sparrow.Server.Platform.Pal.rvn_get_pal_ver()
   at Sparrow.Server.Platform.Pal..cctor() in C:\repos\ravendb\src\Sparrow.Server\Platform\Pal.cs:line 24

I'll try it under WSL2.

AndyAyersMS commented 4 months ago

I can repro (I think) on my volterra under WSL2:

Starting to run 169
Starting to run 170
Xunit.Sdk.EqualException: Assert.Equal() Failure: Values differ
Expected: 1.7976931348623157E+308
Actual:   9.2233720368547758E+18
   at Xunit.Assert.Equal[T](T expected, T actual, IEqualityComparer`1 comparer) in /_/src/xunit.assert/Asserts/EqualityAsserts.cs:line 148
   at Xunit.Assert.Equal[T](T expected, T actual) in /_/src/xunit.assert/Asserts/EqualityAsserts.cs:line 83
   at SlowTests.Issues.RavenDB_14036.CanExtractStoredNumberFieldsUsingJsProjection(Options options) in /home/andya/repos/ravendb/test/SlowTests/Issues/RavenDB-14036.cs:line 133
   at Tryouts.Program.Main(String[] args) in /home/andya/repos/ravendb/test/Tryouts/Program.cs:line 49
AndyAyersMS commented 4 months ago

Above was with 8.0.4.

@EgorBo this also no longer repros for me if I drop in a locally built checked libclrjit.so. Am going to see if release does likewise.

It does.

Going to try an SPMI collection and then diff replay using the retail jit and the ones I've built locally.

AndyAyersMS commented 4 months ago

Three methods with diffs, all are

image

Impacted methods are

<WriteNumber>g__GuessNumberType|15_1
WriteNumber
WriteJsonValue

though without assembly info this is probably not enough to figure out where to disable opts (I can retry with a checked runtime and jit I suppose, if you all want to try crafting a workaround).

Looks like this is https://github.com/dotnet/runtime/issues/92201. The fix was ported to 8.0 in #100372 but hasn't made it into a release yet. Looks like it should appear in 8.0.5, due the middle of May.

redknightlois commented 4 months ago

We have a customer and our own test servers impacted by this. Is there any way we could get the pre-release servicing build in order for those systems to be upgraded?

jkotas commented 4 months ago

We do not publish official pre-release servicing builds. The servicing builds contain security fixes that we cannot disclose publicly before the release.

You may want to create your own build using the same docker image that is used for official builds https://github.com/dotnet/runtime/blob/release/8.0/docs/workflow/building/coreclr/linux-instructions.md#docker-images ,

JulieLeeMSFT commented 4 months ago

We do not publish official pre-release servicing builds. The servicing builds contain security fixes that we cannot disclose publicly before the release.

You may want to create your own build using the same docker image that is used for official builds https://github.com/dotnet/runtime/blob/release/8.0/docs/workflow/building/coreclr/linux-instructions.md#docker-images ,

@redknightlois, please let us know if either your own build or an official servicing build fixes the problem.

redknightlois commented 4 months ago

As soon as the official servicing build is out I will install and check.

redknightlois commented 3 months ago

Confirm 8.0.5 solves the issue.