OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.43k stars 1.51k forks source link

Support building with flang on windows #4768

Open h-vetinari opened 5 months ago

h-vetinari commented 5 months ago

OpenBLAS already added flang support, but I don't think this is being tested on windows? While reviving the old effort to build conda-forge's openblas with flang, I originally ran into some parsing issue with flang 18.

Luckily, with a flang 19 built from main (already built for debugging something else, so I thought I'd try), it seems that particular issue is gone. 🥳

However, I first encountered some CMake detection issues:

-- Found OpenMP_C: -Xclang -fopenmp (found version "5.1")
CMake Error at D:/bld/openblas_1719527807775/_build_env/Library/share/cmake-3.29/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
-- Configuring incomplete, errors occurred!
  Could NOT find OpenMP_Fortran (missing: OpenMP_Fortran_FLAGS
  OpenMP_Fortran_LIB_NAMES)

After iteratively figuring out (also re-encountering #3069 again along the way) that I needed to add (something like)

    -DOpenMP_Fortran_FLAGS=-fopenmp ^
    -DOpenMP_Fortran_LIB_NAMES=libomp ^
    -DOpenMP_libomp_LIBRARY=-llibomp ^
    -DOpenMP_C_FLAGS=-fopenmp ^
    -DOpenMP_C_LIB_NAMES=libomp ^

I then ran into what looks like a regular compilation error:

[...]
[3643/19184] Building Fortran object CMakeFiles\LAPACK_OVERRIDES.dir\lapack-netlib\SRC\stplqt.f.obj
[3644/19184] Building Fortran object CMakeFiles\LAPACK_OVERRIDES.dir\lapack-netlib\SRC\stplqt2.f.obj
[3645/19184] Building Fortran object CMakeFiles\LAPACK_OVERRIDES.dir\lapack-netlib\SRC\stpmlqt.f.obj
[3646/19184] Building Fortran object CMakeFiles\LAPACK_OVERRIDES.dir\lapack-netlib\SRC\ssytrd_2stage.f.obj
[3647/19184] Building Fortran object CMakeFiles\LAPACK_OVERRIDES.dir\lapack-netlib\SRC\ssytrd_sb2st.F.obj
FAILED: CMakeFiles/LAPACK_OVERRIDES.dir/lapack-netlib/SRC/ssytrd_sb2st.F.obj 
%BUILD_PREFIX%\Library\bin\flang.exe -I%SRC_DIR%\lapack-netlib\SRC -I%SRC_DIR%\lapack-netlib\LAPACKE\include -fopenmp -fopenmp -ffixed-line-length-72 -o CMakeFiles\LAPACK_OVERRIDES.dir\lapack-netlib\SRC\ssytrd_sb2st.F.obj -c CMakeFiles\LAPACK_OVERRIDES.dir\lapack-netlib\SRC\ssytrd_sb2st.F-pp.f
error: Semantic errors in CMakeFiles\LAPACK_OVERRIDES.dir\lapack-netlib\SRC\ssytrd_sb2st.F-pp.f
D:\\bld\\openblas_1719536014536\\work\\lapack-netlib\\SRC\\ssytrd_sb2st.F:237:11: error: Cannot read module file for module 'omp_lib': Source file 'omp_lib.mod' was not found
        use omp_lib
            ^^^^^^^
martin-frbg commented 5 months ago

right, flang on Windows appears to be lagging behind the Linux/Unix version. If there is no omp_lib module provided by LLVM, I suggest you open an issue with them (unless already known/documented). This may also be the reason why you needed to set a bunch of cmake variables manually

martin-frbg commented 5 months ago

but ISTR broken OpenMP support in LLVM on Windows is a known problem, and your "flang 19" is an unstable snapshot

h-vetinari commented 5 months ago

but ISTR broken OpenMP support in LLVM on Windows is a known problem

Do you have a link?

and your "flang 19" is an unstable snapshot

Yes, 19.1.0rc1 is only expected in about a month. The problem was that all flang 18 builds were broken for SciPy, and I wanted to test/ensure that things work with flang 19 (early enough for potentially necessary fixes to land), hence why I built from main, as noted in the OP.

martin-frbg commented 5 months ago

somewhere in the discussion in #3973 I think, but it could be that it was broken once, worked for a while and is now broken again.

h-vetinari commented 5 months ago

It seems upstream intends to support it. I'm trying to rebuild as necessary to test that hypothesis.

Meanwhile, I've had one passing run without OpenMP, however, the logs got spammed so badly with warnings (~500MB), that I cannot really check them. After running again with warnings ignored, I get:

97% tests passed, 4 tests failed out of 120
Errors while running CTest
Output from these tests are in: D:/bld/openblas_1719572917500/work/build/Testing/Temporary/LastTest.log

Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
Total Test time (real) = 9104.76 sec

The following tests FAILED:
      5 - sblas3 (Failed)
      8 - dblas3 (Failed)
     11 - cblas3 (Failed)
     15 - zblas3 (Failed)

The runtime especially of complex-valued procedures is off the charts. That's also the part that depends on LLVM's compiler-rt (instead of MSVC's runtime); not sure if that plays a role somehow. An excerpt:

 14/120 Test  #15: zblas3 ......................................................................................***Failed    5.13 sec
        Start  16: zblas3_3m
 15/120 Test  #16: zblas3_3m ...................................................................................   Passed    0.75 sec
        Start  17: REAL_LAPACK_linear_equation_routines
 16/120 Test  #14: zblas2 ......................................................................................   Passed  202.79 sec
        Start  18: COMPLEX_LAPACK_linear_equation_routines
 17/120 Test  #17: REAL_LAPACK_linear_equation_routines ........................................................   Passed  2937.14 sec
        Start  19: DOUBLE_PRECISION_LAPACK_linear_equation_routines
 18/120 Test  #18: COMPLEX_LAPACK_linear_equation_routines .....................................................   Passed  4390.90 sec
        Start  20: COMPLEX16_LAPACK_linear_equation_routines
 19/120 Test  #19: DOUBLE_PRECISION_LAPACK_linear_equation_routines ............................................   Passed  3293.06 sec
        Start  21: SINGLE-DOUBLE_PRECISION_LAPACK_prototype_linear_equation_routines
 20/120 Test  #21: SINGLE-DOUBLE_PRECISION_LAPACK_prototype_linear_equation_routines ...........................   Passed  178.72 sec
        Start  22: Testing_COMPLEX-COMPLEX16_LAPACK_prototype_linear_equation_routines
 21/120 Test  #22: Testing_COMPLEX-COMPLEX16_LAPACK_prototype_linear_equation_routines .........................   Passed  793.80 sec
        Start  23: Testing_REAL_LAPACK_RFP_prototype_linear_equation_routines
 22/120 Test  #23: Testing_REAL_LAPACK_RFP_prototype_linear_equation_routines ..................................   Passed   57.25 sec
        Start  24: Testing_DOUBLE_PRECISION_LAPACK_RFP_prototype_linear_equation_routines
 23/120 Test  #24: Testing_DOUBLE_PRECISION_LAPACK_RFP_prototype_linear_equation_routines ......................   Passed  115.00 sec
        Start  25: Testing_COMPLEX_LAPACK_RFP_prototype_linear_equation_routines
 24/120 Test  #25: Testing_COMPLEX_LAPACK_RFP_prototype_linear_equation_routines ...............................   Passed  338.11 sec
        Start  26: Testing_COMPLEX16_LAPACK_RFP_prototype_linear_equation_routines
 25/120 Test  #26: Testing_COMPLEX16_LAPACK_RFP_prototype_linear_equation_routines .............................   Passed  283.97 sec
        Start  27: SNEP:_Testing_Nonsymmetric_Eigenvalue_Problem_routines
 26/120 Test  #27: SNEP:_Testing_Nonsymmetric_Eigenvalue_Problem_routines ......................................   Passed    1.00 sec
        Start  28: SSEP:_Testing_Symmetric_Eigenvalue_Problem_routines
 27/120 Test  #28: SSEP:_Testing_Symmetric_Eigenvalue_Problem_routines .........................................   Passed    0.70 sec
        Start  29: SSE2:_Testing_Symmetric_Eigenvalue_Problem_routines
 28/120 Test  #29: SSE2:_Testing_Symmetric_Eigenvalue_Problem_routines .........................................   Passed    1.89 sec
        Start  30: SSVD:_Testing_Singular_Value_Decomposition_routines
 29/120 Test  #30: SSVD:_Testing_Singular_Value_Decomposition_routines .........................................   Passed    0.71 sec
        Start  31: SSEC:_Testing_REAL_Eigen_Condition_Routines
 30/120 Test  #31: SSEC:_Testing_REAL_Eigen_Condition_Routines .................................................   Passed    0.60 sec
        Start  32: SSEV:_Testing_REAL_Nonsymmetric_Eigenvalue_Driver
h-vetinari commented 5 months ago

FWIW, the time to run ctest -j2 on our current builds is:

100% tests passed, 0 tests failed out of 120

Total Test time (real) = 6305.40 sec

So flang takes about 50% more time.

h-vetinari commented 5 months ago

on a rerun of the flang-built openblas (without openmp), I now get:

100% tests passed, 0 tests failed out of 120

Total Test time (real) = 8278.32 sec

My suspicion is that the CI agents switching between different CPU types randomly is the cause for the pass or fail. If necessary, I can try to validate that hypothesis.

martin-frbg commented 5 months ago

maybe alternating between avx512 and non-avx512 , common problem with azure-ci and ISTR mmuetzel had noted problems with win-llvm and avx512 previously

martin-frbg commented 5 months ago

omp mod problem could also be missing include path to modules in llvm install path. will try to look at your logs later if able

h-vetinari commented 5 months ago

It fails with the following instructions found for the CPU of the CI agent (according to numpy):

    "found": [
      "SSSE3",
      "SSE41",
      "POPCNT",
      "SSE42",
      "AVX",
      "F16C",
      "FMA3",
      "AVX2",
      "AVX512F",
      "AVX512CD",
      "AVX512_SKX"
    ],
    "not found": [
      "AVX512_CLX",
      "AVX512_CNL",
      "AVX512_ICL"
    ]

I managed to fix the omp_lib.mod problem, the openmp-enabled build now fails with:

[17746/19184] Linking Fortran shared library lib\openblas.dll
FAILED: lib/openblas.dll lib/Release/openblas.lib 
C:\Windows\system32\cmd.exe /C "C:\Windows\system32\cmd.exe /C "%BUILD_PREFIX%\Library\bin\cmake.exe -E __create_def %SRC_DIR%\build\CMakeFiles\openblas_shared.dir\.\exports.def %SRC_DIR%\build\CMakeFiles\openblas_shared.dir\.\exports.def.objs --nm=%BUILD_PREFIX%\Library\bin\llvm-nm.exe && cd %SRC_DIR%\build" && %BUILD_PREFIX%\Library\bin\cmake.exe -E vs_link_dll --intdir=CMakeFiles\openblas_shared.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\x64\mt.exe --manifests  -- C:\PROGRA~1\LLVM\bin\lld-link.exe /nologo @CMakeFiles\openblas_shared.rsp  /out:lib\openblas.dll /implib:lib\Release\openblas.lib /pdb:lib\openblas.pdb /dll /version:0.3 /machine:x64 /INCREMENTAL:NO /DEBUG /OPT:REF /OPT:ICF  /DEF:CMakeFiles\openblas_shared.dir\.\exports.def  -libpath:"D:/bld/openblas_1719644456116/_build_env/Library/lib" -libpath:"D:/bld/openblas_1719644456116/_build_env/Library/lib/clang/19/lib/windows"  && cd ."
LINK: command "C:\PROGRA~1\LLVM\bin\lld-link.exe /nologo @CMakeFiles\openblas_shared.rsp /out:lib\openblas.dll /implib:lib\Release\openblas.lib /pdb:lib\openblas.pdb /dll /version:0.3 /machine:x64 /INCREMENTAL:NO /DEBUG /OPT:REF /OPT:ICF /DEF:CMakeFiles\openblas_shared.dir\.\exports.def -libpath:D:/bld/openblas_1719644456116/_build_env/Library/lib -libpath:D:/bld/openblas_1719644456116/_build_env/Library/lib/clang/19/lib/windows /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output:
lld-link: warning: ignoring unknown argument '-lpthreads'
lld-link: error: undefined symbol: omp_get_max_threads
martin-frbg commented 5 months ago

The lpthreads argument error is probably bogus. Do you see -fopenmp or -lomp in the build log ? Do you see omp_get_max_threads in nm or dumpbin output of the llvm-provided libomp ?

h-vetinari commented 5 months ago

Passing configuration (non-openmp):

    "found": [
      "SSSE3",
      "SSE41",
      "POPCNT",
      "SSE42",
      "AVX",
      "F16C",
      "FMA3",
      "AVX2"
    ],
    "not found": [
      "AVX512F",
      "AVX512CD",
      "AVX512_SKX",
      "AVX512_CLX",
      "AVX512_CNL",
      "AVX512_ICL"
    ]

So indeed some AVX512 issue seems likely

h-vetinari commented 5 months ago

Do you see -fopenmp or -lomp in the build log ?

Yeah, I'm even passing that explicitly in -DOpenMP_Fortran_FLAGS=-fopenmp. To be honest, I have no idea where the -lpthreads is coming from.

martin-frbg commented 5 months ago

seen some google hits suggesting that it is simply fallout from one of cmake's standard configuration check scripts and can be ignored. but omp_get_max_threads should definitely be in llvm's libomp (or at least used to be in earlier versions)

h-vetinari commented 5 months ago

Looking closer, I'm almost certain this is a question of the wrong library path being picked:

-libpath:"D:/bld/openblas_1719644456116/_build_env/Library/lib" -libpath:"D:/bld/openblas_1719644456116/_build_env/Library/lib/clang/19/lib/windows"

Both of these are pointing to the build environment, not the host where openmp is actually present. Perhaps the path is constructed relative to clang/flang?

h-vetinari commented 4 months ago

Two good news, two bad news

Good news #1

I managed to build openblas in conda-forge with flang 19 and the following patch:

commit 4cab116c4e811f977c816f3027df57d1fba3bd0c
Author: H. Vetinari <h.vetinari@gmx.com>
Date:   Mon Jul 1 09:05:00 2024 +1100

    explicitly link to OpenMP

    bump CMake minimum to adopt CMP0074 for checking OpenMP_ROOT, see
    https://cmake.org/cmake/help/latest/policy/CMP0074.html

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 096ca88b3..ef505f70d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,7 +2,7 @@
 ## Author: Hank Anderson <hank@statease.com>
 ##

-cmake_minimum_required(VERSION 2.8.5)
+cmake_minimum_required(VERSION 3.12)

 project(OpenBLAS C ASM)

@@ -100,6 +100,10 @@ endif()

 message(WARNING "CMake support is experimental. It does not yet support all build options and may not produce the same Makefiles that OpenBLAS ships with.")

+if (USE_OPENMP)
+  find_package(OpenMP REQUIRED)
+endif ()
+
 include("${PROJECT_SOURCE_DIR}/cmake/utils.cmake")
 include("${PROJECT_SOURCE_DIR}/cmake/system.cmake")

@@ -256,6 +260,15 @@ if (${CMAKE_SYSTEM_NAME} MATCHES "AIX|Android|Linux|FreeBSD|OpenBSD|NetBSD|Drago
   endif()
 endif()

+if (USE_OPENMP)
+  if(BUILD_STATIC_LIBS)
+    target_link_libraries(${OpenBLAS_LIBNAME}_static OpenMP::OpenMP_C)
+  endif()
+  if(BUILD_SHARED_LIBS)
+    target_link_libraries(${OpenBLAS_LIBNAME}_shared OpenMP::OpenMP_C)
+  endif()
+endif()
+
 # Seems that this hack doesn't required since macOS 11 Big Sur
 if (APPLE AND BUILD_SHARED_LIBS AND CMAKE_HOST_SYSTEM_VERSION VERSION_LESS 20)
   set (CMAKE_C_USE_RESPONSE_FILE_FOR_OBJECTS 1)

This is to fix the linkage errors for symbols like omp_get_max_threads mentioned above. I'm not claiming that the patch is directly upstreamable, but if you want I can open a PR.

Bad news #1

Finding LLVM's OpenMP correctly using CMake's find_package(OpenMP) is apparently fraught. Even using the OpenMP_ROOT work-around noted here, the Fortran-bits do not get picked up correctly, requiring a manual override:

    REM not picked up by `find_package(OpenMP)` for some reason
    set "CMAKE_EXTRA=-DOpenMP_Fortran_FLAGS=-fopenmp -DOpenMP_Fortran_LIB_NAMES=libomp"
    set "FFLAGS=%FFLAGS% -I%LIBRARY_PREFIX%\include"

Good news #2

The test suite is extraordinarily faster than without openmp:

Total Test time (real) =  98.31 sec

That's roughly 85x faster than the result we get for llvm-flang-built OpenBLAS without OpenMP (which is probably too good to be true, in the sense that the pthreads variant is probably being pessimized somewhere/somehow), and still ~65x faster than the current pthreads-based production builds; c.f. also https://github.com/conda-forge/openblas-feedstock/issues/160

Bad news #2

The same kind of failures as with pthreads occur on agents with AVX512 instructions:

97% tests passed, 4 tests failed out of 120

The following tests FAILED:
      5 - sblas3 (Failed)
      8 - dblas3 (Failed)
     11 - cblas3 (Failed)
     15 - zblas3 (Failed)
martin-frbg commented 1 month ago

@h-vetinari how did you fix the omp_lib.mod problem back then ? I get the impression that this module is genuinely missing from the llvm 19.1.2and llvm-openmp packages currently on conda-forge (and same when specifying the path conda-forge/label/llvm_dev)

martin-frbg commented 1 month ago

err, I have belatedly found your openmp-feedstock PR 134 that got merged three weeks ago - has no new package been released since then, or do I need a specific incancantation to install it from conda-forge ?

h-vetinari commented 1 month ago

has no new package been released since then, or do I need a specific incancantation to install it from conda-forge ?

During the rc-phase the .mod files were just part of llvm-openmp itself, but we ended up separating those out into a separate output called llvm-openmp-fortran. If you specify that in the host environment (or just install it if you're working locally), you should be good (our compiler activation automatically does -I%LIBRARY_PREFIX%\include).

martin-frbg commented 1 month ago

Thanks, got it installed now. Adding the include paths to the FFLAGS still seems to be necessary to get it picked up, will see how the build goes...

h-vetinari commented 1 month ago

FYI: If you install flang_win-64 as opposed to flang, you get the "activated" compiler that does set FFLAGS and a few other things.

martin-frbg commented 1 month ago

hmm, that appears to break the build for me

h-vetinari commented 1 month ago

hmm, that appears to break the build for me

Here's what we do in the flang activation: https://github.com/conda-forge/flang-activation-feedstock/blob/main/recipe/activate.bat

If you just install flang (+lld), you can always set these flags manually if you'd like.

martin-frbg commented 1 month ago

Thank you - actually not sure if what I'm seeing now is a problem with flang or the underlying Windows installation (this is a brand new AMD "strix point" laptop that shows some idiosyncrasies).Both ninja and makefile drivers suddenly fail by trying to link random objects that have not been compiled yet. System time appears to be sane though - I'll probably try reinstalling everything starting from miniforge...

martin-frbg commented 1 month ago

Turns out I had a stray -I in my FFLAGS that caused the special effects 😭

martin-frbg commented 1 month ago

So I can confirm that OpenMP does not appear to cause additional accuracy errors with LLVM19.1 on Windows anymore. @h-vetinari would you like to prepare a PR for your OpenMP-related fixes in CMakeLists.txt ? The minimum_required has already been adjusted in the meantime. After that, only the linking of xerbla in the tests needs fixing, and the cause of the unresolved __imp_... symbols when doing static linking. (The latter may well be a purely local problem with my installation of the VS2022 community edition)

h-vetinari commented 1 month ago

@h-vetinari would you like to prepare a PR for your OpenMP-related fixes in CMakeLists.txt ?

happy to

After that, only the linking of xerbla in the tests needs fixing, and the cause of the unresolved __imp_... symbols when doing static linking.

Not sure if we're talking about the same problems, but I managed to build lapack (including tests & xerbla shenanigans) with flang. Needed two patches to CBLAS/testing/* though.

martin-frbg commented 1 month ago

thanks. I cannot reproduce the __impl linking problems anymore so they may have been another aspect of my initially broken setup.

martin-frbg commented 2 days ago

Meh, thought we had everything working with the conda-forge build of LLVM19 but the simple test code from #4886 crashes on my new Zen5 laptop unless I compile OpenBLAS for Haswell.