dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

arm64: load 64bit constants from data section instead of 4 movz/k instructions #109428

Closed EgorBo closed 14 hours ago

EgorBo commented 1 day ago

Experiment: I am just curious about the size wins and performance impact

static long Test() => 0xBAA293549432543L;
-movz    x0, #0x2543
-movk    x0, #0x4943 LSL #16
-movk    x0, #0x2935 LSL #32
-movk    x0, #0xBAA LSL #48
+ldr     x0, [@RWD00]

SPMI diffs:

Screenshot 2024-11-01 at 19 48 35

Although, the diffs don't take data section into account (8 bytes + potential alignment but with ability to use a single shared constant for multiple places in a method) => on average it's still a size win.

Unfortunately, it makes not much sense for R2R/NAOT since those mostly use relocatable constants and rarely need raw 64bit constants.

dotnet-policy-service[bot] commented 1 day ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

EgorBo commented 1 day ago

@EgorBot -arm64 -profiler

using BenchmarkDotNet.Attributes;

public class Bencha
{
    static object obj = new MyClass();

    [Benchmark]
    public void Bench()
    {
        if (obj is MyClass myClass)
            myClass.DoWork();
    }
}

public class MyClass {
    public virtual void DoWork() {}
}
kunalspathak commented 14 hours ago

@EgorBot -arm64 -profiler

so seems this is 2X slower looking at the benchmark results?

tannergooding commented 14 hours ago

so seems this is 2X slower looking at the benchmark results?

I imagine this is too small to actually be measured by BDN and is likely largely dependent on the hardware and surrounding code. We should probably loop in the folks at ARM for official guidance (cc. @TamarChristinaArm).

My guess is that some hardware will "fuse" neighboring movz/movk into a constant on the backend, while others will actually incur construction cost. On all hardware it will likely impact decoding bandwidth.

Inversely loading from method local memory has its own downsides, since that memory page is marked "executable". Ignoring that downside, however, it will likely be in the L1 data cache and incur an approx 4 cycle load time, unless the hardware has an optimization to cache such recent loads in spare registers from the register file (as some x64 chips do).

My guess is its mostly a wash and the right choice comes down to whether we're optimizing for size.

EgorBo commented 14 hours ago

@EgorBot -arm64 -profiler

using BenchmarkDotNet.Attributes;

public class Bencha
{
    static object obj = new MyClass();

    [Benchmark]
    public void Bench()
    {
        if (obj is MyClass myClass1)
            myClass1.DoWork();
        if (obj is MyClass myClass2)
            myClass2.DoWork();
        if (obj is MyClass myClass3)
            myClass3.DoWork();
        if (obj is MyClass myClass4)
            myClass4.DoWork();
        if (obj is MyClass myClass5)
            myClass5.DoWork();
        if (obj is MyClass myClass6)
            myClass6.DoWork();
    }
}

public class MyClass {
    public virtual void DoWork() {}
}
EgorBo commented 14 hours ago

Yeah I am not planning to go further with this PR, but it might be interesting to see how performance is different. E.g. for jump stubs we do ldr instead of movz/k