Tests failing on Windows

JuliaLinearAlgebra / BLISBLAS.jl

BLIS-pendant of MKL.jl

MIT License

11 stars 4 forks source link

Tests failing on Windows #15

Closed carstenbauer closed 1 month ago

carstenbauer commented 1 month ago

See https://github.com/JuliaLinearAlgebra/BLISBLAS.jl/actions/runs/9282233656

carstenbauer commented 1 month ago

Version 0.9 of the JLL also doesn't seem to work: https://github.com/JuliaLinearAlgebra/BLISBLAS.jl/pull/13#issuecomment-2136802561

jd-foster commented 1 month ago

I can't recall now whether I tested blis_jll v0.9 directly on windows, but v0.8 has worked before. So this is awkward....

The failure on Julia 1.8 is known already - this was a "notorious" bug that required updating GCC 12 itself - https://github.com/JuliaLang/julia/pull/50135 https://github.com/JuliaLinearAlgebra/BLIS.jl/issues/23 Is it necessary to test on Julia 1.8?

Will need to look more closely at the issues in Julia 1.10 and nightly, which are a different type of error, involving the cblas compatibility layer in BLIS, i.e. cdotc_64_ calling bli_cdotv_zen3_ref which may be a fallback function for the CI machine.

eschnett commented 1 month ago

I looked at the header files in blis.v1.0.0.x86_64-w64-mingw32.tar.gz. They contain the lines

#if 0
#define BLIS_ENABLE_COMPLEX_RETURN_INTEL
#else
#define BLIS_DISABLE_COMPLEX_RETURN_INTEL
#endif

indicating that it was configured as expected, using the GFortran complex return convention.

I also looked at the disassembled function zdotc_64_. It has in the beginning the statement

30995451b: 49 89 cc                     movq    %rcx, %r12

which saves the first argument (%rcx) in register %r12. Near the end there are the statements

3099545b9: 4c 89 e0                     movq    %r12, %rax
3099545bc: 41 0f 11 04 24               movups  %xmm0, (%r12)

which copy %r12 to %rax, which holds the return value. That is, this function returns either an integer or a pointer. It also stores two double-precision floating-point values into the address pointed to by %r12.

This looks as if this function was using the Intel Fortran complex return convention.

If that is correct then BLIS would have been miscompiled, either due to of an error in its configuration management or due to an error in our Yggdrasil recipe.

imciner2 commented 1 month ago

Looking at this a bit, I think maybe the Windows x64 calling convention may actually be using the Intel-style convention for returning a complex number, so I think GCC is then actually implicitly converting the given code to do the Intel return format when compiling on Windows.

Looking at the only "official" calling convention document that I can find: https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170, the return values are handled like this:

A scalar return value that can fit into 64 bits, including the m64 type, is returned through RAX. Non-scalar types including floats, doubles, and vector types such as [m128](https://learn.microsoft.com/en-us/cpp/cpp/m128?view=msvc-170), __m128i, __m128d are returned in XMM0. The state of unused bits in the value returned in RAX or XMM0 is undefined.

User-defined types can be returned by value from global functions and static member functions. To return a user-defined type by value in RAX, it must have a length of 1, 2, 4, 8, 16, 32, or 64 bits. It must also have no user-defined constructor, destructor, or copy assignment operator. It can have no private or protected non-static data members, and no non-static data members of reference type. It can't have base classes or virtual functions. And, it can only have data members that also meet these requirements. (This definition is essentially the same as a C++03 POD type. Because the definition has changed in the C++11 standard, we don't recommend using std::is_pod for this test.) Otherwise, the caller must allocate memory for the return value and pass a pointer to it as the first argument. The remaining arguments are then shifted one argument to the right. The same pointer must be returned by the callee in RAX.

So it may be that it is treating a complex number as a struct or vector of doubles, so it then uses the pointer return method.

Another reference I found was Raymond Chen's blog here. There are two posts that might be relevant. In https://devblogs.microsoft.com/oldnewthing/20040114-00/?p=41053, he says

The return value is placed in rax. If the return value is larger than 64 bits, then a secret first parameter is passed which contains the address where the return value should be stored.

So since a complex would be greater than 64 bits, then I think it would be the secret first parameter method (aka. what is called Intel here).

There is also https://devblogs.microsoft.com/oldnewthing/20101222-00/?p=11943, where he says

Although the Microsoft C compiler supports a calling convention called fortran, that’s just what the calling convention is called; its relationship with the FORTRAN programming language is only coincidental. The fortran keyword is now just an old-fashioned synonym for __stdcall.

Which sounds to me like this will be the calling convention that is used normally. He then further says:

Functions which return COMPLEX or CHARACTER() are internally rewritten as subroutines where the location to store the return value is passed as a hidden first parameter. (This is analogous to how C returns large structures.)

Unfortunately, I can't find any definitive source that talks about this in more detail, and my spelunking in the GCC source code didn't lead to anything understandable about how the Windows ABI looks like here.

imciner2 commented 1 month ago

Follow-up, looking at the LLVM source code seems to suggest that Windows uses the indirect return for a complex type (e.g. the hidden argument).

Inside the Windows ABI info it has (https://github.com/llvm/llvm-project/blob/46c05dfb6c98870f8416eeb9bf787d54ac806b12/clang/lib/CodeGen/Targets/X86.cpp#L3303)

  if (RT || Ty->isAnyComplexType() || Ty->isMemberPointerType()) {
    // MS x64 ABI requirement: "Any argument that doesn't fit in 8 bytes, or is
    // not 1, 2, 4, or 8 bytes, must be passed by reference."
    if (Width > 64 || !llvm::isPowerOf2_64(Width))
      return getNaturalAlignIndirect(Ty, /*ByVal=*/false);

    // Otherwise, coerce it to a small integer.
    return ABIArgInfo::getDirect(llvm::IntegerType::get(getVMContext(), Width));
  }

And for the complex double, the Width will be 128 - so it will use an indirect reference for the variable.

Does this call work through OpenBLAS? I wonder what calling convention it follows on Windows.

staticfloat commented 1 month ago

I'm using the following test script:

using BLISBLAS, LinearAlgebra

println("Float64:")
a = ones(Float64, 4096) .+ 1im
@show BLAS.dotc(a, a)

println("Float32:")
a = ones(Float32, 4096) .+ 1im
@show BLAS.dotc(a, a)
println("Done!")

On my machine, the Float64 version works just fine, but the Float32 one breaks. If I manually force LBT to use the "normal" complex return style (instead of the "hidden argument" style), OR to forgo CBLAS divergence normalization, everything just works (you can do this easily with this LBT branch). The fact that I get the correct answer both ways is interesting; it looks to me like BLIS has implemented their "argument style" return values in such a way that the correct answer is simultaneously available in both the first argument and the return value. (This can be seen in the zdotc_64_ disassembly that @eschnett posted above; the value is available in both *($rcx) and $rax).

I've spent almost the whole day on this, digging into the assembly, debugging LBT and am currently very confused. To trace out the sequence of events, first Julia calls cblas_cdotc_sub64_ in LBT, which does not immediately forward to BLIS, instead it first goes to our CBLAS shim lbt_cblas_cdotc_sub64_ which then calls LBT's own cdotc_64_. Why does it do this? Because it has determined that BLIS is "CBLAS-divergent" which is a fancy way of saying that the library has properly-mangled Fortran symbols (e.g. cdotc_64_, but does not have properly-mangled CBLAS symbols (e.g. it does not have cblas_cdotc_sub64_, it only has cblas_cdotc_sub). I first came across this in MKL where there are both ILP64 and LP64 symbols in the same library, so we can't just use the non-prefixed names willy-nilly, they might be the wrong interface!

So instead, we re-translate back to the fortran names, which shouldn't cause any issue, we should simply be able to wrap a few arguments in references and be on our way. When we call cdotc_64_ we still don't jump straight to BLIS, because we've also decided that BLIS does argument-style complex returns, so we hit our cmplxret_cdotc_64_ adapter next. Finally, from there, we jump into BLIS at the cdotc_64_ symbol.

jd-foster commented 1 month ago

Thanks for your efforts all! Copying over key information from Slack, which conforms with above:

The call is running through LBT and when calling cdotc, where it’s calling through cmplxret_cdotc_64_. I understand BLIS by default uses the gfortran interface (returning a ComplexF32 value) rather than passing through an argument as the Intel style does, so it’s crashing on this incompatibility. This doesn’t show up for linux or macOS.

Stacktrace from the CI on this issue:

 Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ffeeb560590 -- bli_cdotv_zen3_ref at C:\Users\runneradmin\.julia\artifacts\9b9fc411ec291c3a58b3853ecd815fc9beab13ee\bin\libblis.dll (unknown line)
in expression starting at C:\hostedtoolcache\windows\julia\1.10.3\x64\share\julia\stdlib\v1.10\LinearAlgebra\test\blas.jl:33
bli_cdotv_zen3_ref at C:\Users\runneradmin\.julia\artifacts\9b9fc411ec291c3a58b3853ecd815fc9beab13ee\bin\libblis.dll (unknown line)
cdotc_64_ at C:\Users\runneradmin\.julia\artifacts\9b9fc411ec291c3a58b3853ecd815fc9beab13ee\bin\libblis.dll (unknown line)
cmplxret_cdotc_64_ at /workspace/srcdir/libblastrampoline/src\complex_return_style_adapters.c:92
lbt_cblas_cdotc_sub64_ at /workspace/srcdir/libblastrampoline/src\cblas_adapters.c:83
dotc at C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\blas.jl:362 [inlined]
dotc at C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\blas.jl:395

On Windows:

julia> using blis_jll

julia> LinearAlgebra.BLAS.lbt_forward(blis_jll.blis; clear=true, verbose=true, suffix_hint="64_")
Generating forwards to C:\Users\~\.julia\artifacts\9b9fc411ec291c3a58b3853ecd815fc9beab13ee\bin\libblis.dll (clear: 1, verbose: 1, suffix_hint: '64_')
 -> Autodetected symbol suffix "64_"
 -> Autodetected interface ILP64 (64-bit)
 -> Autodetected argument-passing complex return style
 -> Autodetected gfortran calling convention
 -> Autodetected CBLAS-divergent library!
 - [2732] complex(cdotc_64_)
 - [2733] complex(cdotu_64_)
 - [4439] complex(zdotc_64_)
 - [4440] complex(zdotu_64_)
 - [2547] cblas(cblas_cdotc_sub64_)
 - [2549] cblas(cblas_cdotu_sub64_)
 - [2695] cblas(cblas_zdotc_sub64_)
 - [2697] cblas(cblas_zdotu_sub64_)
 - [2588] cblas(cblas_ddot64_)
 - [2655] cblas(cblas_sdot64_)
Processed 4949 symbols; forwarded 167 symbols with 64-bit interface and mangling to a suffix of "64_"
167

This is on macOS (same environment as above):

julia> LinearAlgebra.BLAS.lbt_forward(blis_jll.blis; clear=true, verbose=true, suffix_hint="64_")
Generating forwards to /Users/fos08b/.julia/artifacts/e3f67ac3e3d64cc5c2d61619eb826bc28f3d0a62/lib/libblis.4.0.0.dylib (clear: 1, verbose: 1, suffix_hint: '64_')
 -> Autodetected symbol suffix "64_"
 -> Autodetected interface ILP64 (64-bit)
 -> Autodetected normal complex return style
 -> Autodetected gfortran calling convention
 -> Autodetected CBLAS-divergent library!
 - [2547] cblas(cblas_cdotc_sub64_)
 - [2549] cblas(cblas_cdotu_sub64_)
 - [2695] cblas(cblas_zdotc_sub64_)
 - [2697] cblas(cblas_zdotu_sub64_)
 - [2588] cblas(cblas_ddot64_)
 - [2655] cblas(cblas_sdot64_)
Processed 4949 symbols; forwarded 167 symbols with 64-bit interface and mangling to a suffix of "64_"
167

So there seems to be a difference with the complex autodetection of cdotc, ... zdotu, which are also precisely the functions covered by the BLIS complex-return option in https://github.com/flame/blis/pull/434

jd-foster commented 1 month ago

FYI, this logic in BLIS is implemented from here: https://github.com/flame/blis/blob/5cbec6503de335b3b63fa5d4f388fddd3aff2b61/frame/compat/bla_dot.c#L89-L147 https://github.com/flame/blis/blob/5cbec6503de335b3b63fa5d4f388fddd3aff2b61/frame/include/bli_gentfunc_macro_defs.h#L89-L94 and https://github.com/flame/blis/blob/master/frame/compat/cblas/f77_sub/f77_dot_sub.c

staticfloat commented 1 month ago

So it appears to me that the BLIS folks also implemented cblas_cdotc_sub in terms of cdotc_: https://github.com/flame/blis/blob/5cbec6503de335b3b63fa5d4f388fddd3aff2b61/frame/compat/cblas/f77_sub/f77_dot_sub.c#L46-L60

jd-foster commented 1 month ago

The top question I have is why is autodetection different on windows to linux/mac systems.

staticfloat commented 1 month ago

The top question I have is why is autodetection different on windows to linux/mac systems.

My best guess is that it's a confusion between the ABI matching that I'm trying to do in LBT, and the special corners of the Windows x64 ABI that Ian pointed out above.

Functions which return COMPLEX or CHARACTER() are internally rewritten as subroutines where the location to store the return value is passed as a hidden first parameter. (This is analogous to how C returns large structures.)

I wrote the following test code:

#include <stdio.h>
#include <complex.h>

__attribute__((noinline)) complex double foo(float x) {
        return 1.0 + x*I;
}

int main() {
        volatile float x = 1.0;
        complex double z = foo(x);
        printf("%f + %fi\n", creal(z), cimag(z));
        return 0;
}

I compiled and disassembled it:

$ gcc -O3 -o test.exe test.c
$ objdump.exe --disassemble=main test.exe
0000000140005ef0 <main>:
   140005ef0:   48 83 ec 48             sub    $0x48,%rsp
   140005ef4:   e8 f5 b5 ff ff          call   1400014ee <__main>
   140005ef9:   f3 0f 10 05 17 11 00    movss  0x1117(%rip),%xmm0        # 140007018 <.rdata+0x18>
   140005f00:   00
   140005f01:   48 8d 4c 24 20          lea    0x20(%rsp),%rcx
   140005f06:   f3 0f 11 44 24 3c       movss  %xmm0,0x3c(%rsp)
   140005f0c:   f3 0f 10 4c 24 3c       movss  0x3c(%rsp),%xmm1
   140005f12:   e8 39 b5 ff ff          call   140001450 <foo>
   140005f17:   f2 0f 10 54 24 28       movsd  0x28(%rsp),%xmm2
   140005f1d:   f2 0f 10 4c 24 20       movsd  0x20(%rsp),%xmm1
   140005f23:   48 8d 0d d6 10 00 00    lea    0x10d6(%rip),%rcx        # 140007000 <.rdata>
   140005f2a:   66 49 0f 7e d0          movq   %xmm2,%r8
   140005f2f:   66 48 0f 7e ca          movq   %xmm1,%rdx
   140005f34:   e8 c7 b4 ff ff          call   140001400 <printf.constprop.0>
   140005f39:   31 c0                   xor    %eax,%eax
   140005f3b:   48 83 c4 48             add    $0x48,%rsp
   140005f3f:   c3                      ret

$ objdump.exe --disassemble=foo test.exe
0000000140001450 <foo>:
   140001450:   66 0f ef c0             pxor   %xmm0,%xmm0
   140001454:   f3 0f 59 c1             mulss  %xmm1,%xmm0
   140001458:   48 89 c8                mov    %rcx,%rax
   14000145b:   f3 0f 5a c9             cvtss2sd %xmm1,%xmm1
   14000145f:   f3 0f 5a c0             cvtss2sd %xmm0,%xmm0
   140001463:   f2 0f 58 05 a5 5b 00    addsd  0x5ba5(%rip),%xmm0        # 140007010 <.rdata+0x10>
   14000146a:   00
   14000146b:   66 0f 14 c1             unpcklpd %xmm1,%xmm0
   14000146f:   0f 11 01                movups %xmm0,(%rcx)
   140001472:   c3                      ret

It looks to me like we're both passing on the stack, and in registers, both to foo and back to main. Additionally, if I use complex float instead of complex double, we only pass through the registers.

staticfloat commented 1 month ago

The fact that LBT works for complex double and not for complex float is really interesting. My current thought is that I am applying an inappropriate workaround for this complex return style on a complex float function when I don't actually need to. In the morning, I will check the ABIs of Linux and macOS and whatnot and see if the rules there are different; if it turns out that complex float can be register-passed only on Windows, that would explain this discrepancy.

imciner2 commented 1 month ago

Because it has determined that BLIS is "CBLAS-divergent" which is a fancy way of saying that the library has properly-mangled Fortran symbols (e.g. cdotc64, but does not have properly-mangled CBLAS symbols (e.g. it does not have cblas_cdotcsub64, it only has cblas_cdotc_sub).

Hmm, yes, we don't appear to be patching BLIS to change any of the CBLAS calls to have a 64 suffix. Our patch labelled "cblas" in Yggdrasil is actually patching the calls that CBLAS is making to normal blas to have the suffix. This would be annoying to do though because all the CBLAS functions are just straight definitions and aren't formed using any macros to build the names.

it looks to me like BLIS has implemented their "argument style" return values in such a way that the correct answer is simultaneously available in both the first argument and the return value.

I get the feeling this is a quirk of the compiler, actually. When actually doing the indirect return method, BLIS has a void return type.

My current thought is that I am applying an inappropriate workaround for this complex return style on a complex float function when I don't actually need to.

My reading of the Windows x86_64 ABI suggests that complex float won't be an indirect return, because it is 64 bits (the condition in the LLVM compiler for generating this indirect return has the strict greater-than check for 64 bit width). So I think that it should treat it as a normal return instead.

I will check the ABIs of Linux and macOS and whatnot and see if the rules there are different; if it turns out that complex float can be register-passed only on Windows, that would explain this discrepancy.

At least for the Linux ABI, it appears it will be returned in registers for both float and double. LLVM implements the classification for complex numbers here: https://github.com/llvm/llvm-project/blob/46c05dfb6c98870f8416eeb9bf787d54ac806b12/clang/lib/CodeGen/Targets/X86.cpp#L1942, and it says it uses SSE style behavior for both float and double element types. The SSE style behavior will use the vector registers to hold the data. (The formal specification of this is in https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf, Section 3.2.3).

So it looks like the difference is probably that Windows mixes the two return styles (double uses implicit arguments, float uses registers), whereas Linux uses registers for both. Does LBT do the complex return classification on a per-function basis, or is it done globally for the library a single time?

staticfloat commented 1 month ago

Alright, I've got a PR up to LBT that solves this by adding a new complex return style, one that will create adapters for only complex double-returning functions (like zdotc) but not for complex float-returning functions (like cdotc). My test code (and the BLISBLAS.jl test suite) pass when using it, but you need a small patch to Julia to make LinearAlgebra happy with the new LBT info code.

This should all appear in Julia v1.12, and is potentially backportable as there's few changes to Julia itself, although I did touch quite a bit of LBT, so we'll have to wait and see if those changes are all good.

staticfloat commented 1 month ago

Once https://github.com/JuliaLang/julia/pull/54791 is merged, julia nightly should "just work" here.

staticfloat commented 1 month ago

Just to close the loop, I re-ran the tests on main and can confirm that while Julia v1.10 still segfaults, the latest nightly does not.