conda-forge / intel_repack-feedstock

A conda-smithy repository for intel_repack.
BSD 3-Clause "New" or "Revised" License
6 stars 19 forks source link

Regresssions in MKL 2025.0 #83

Open h-vetinari opened 2 weeks ago

h-vetinari commented 2 weeks ago

In addition to the question whether mkl now really requires __glibc >=2.28 on linux, I tested MKL 2025.0 against the test suite from netlib lapack, and it seems there's some substantial test failures.

The reason why I'm almost certain that it's unrelated to the switch to flang, is that MKL 2024.2 + flang only has the following failures (logs):

97% tests passed, 3 tests failed out of 95

Total Test time (real) =  35.58 sec

The following tests FAILED:
      1 - LAPACK-xlintsts_stest_in (Failed)
     22 - LAPACK-xlintstd_dtest_in (Failed)
     57 - LAPACK-xlintstz_ztest_in (Failed)

The errors roughly look as follows

Intel oneMKL ERROR: Parameter 1 was incorrect on entry to ZGEMM .
Intel oneMKL ERROR: Parameter 2 was incorrect on entry to ZGEMM .
Intel oneMKL ERROR: Parameter 3 was incorrect on entry to ZGEMM .

Perhaps this is created to some linkage issue? Was something changed w.r.t. the compiler setup for MKL 2025.0 that could have affected the symbol names?

CC @ZzEeKkAa @Alexsandruss @oleksandr-pavlyk @isuruf

oleksandr-pavlyk commented 1 week ago

Only SYCL components of MKL need 2.28 as it is needed by DPC++ runtime.

Defer to @mkrainiuk for the remaining issues.

h-vetinari commented 1 week ago

the simplest upgrade runs into constraints with libhwloc, see here (--> not the fault of this feedstock per se, but cannot test)

That constraint was fixed, and the same 75/95 failures now also appear completely without any change to the compilers (logs).

@mkrainiuk, please advise what's going on here or how we can fix it.

mkrainiuk commented 1 week ago

Looks like oneMKL might have some API changes, adding @sknepper for confirmation. Another potential problem might be the compilation and link with oneMKL are not correct (e.g. the test was built with -DMKL_ILP64 flag but it used LP64 oneMKL interface library), could someone help me to get the exact build logs with compilation and link lines? Unfortunately I can't find this information in the log of failed step from https://github.com/conda-forge/blas-feedstock/pull/128 ...

h-vetinari commented 1 week ago

Thanks for the response!

could someone help me to get the exact build logs with compilation and link lines? Unfortunately I can't find this information in the log of failed step from conda-forge/blas-feedstock#128 ...

In the blas metapackage we only build the tests from https://github.com/Reference-LAPACK/lapack/ and run them against the various blas implementations. The MKL packages themselves aren't built in conda-forge, they're only repackaged, so I cannot offer logs on that. Presumably they should be available somewhere Intel-internally?

e.g. the test was built with -DMKL_ILP64 flag but it used LP64 oneMKL interface library

Not sure if my info there is incorrect or out of date, but didn't MKL use to build both ILP64 & LP64 symbols into the same library?

sknepper commented 1 week ago

That constraint was fixed, and the same 75/95 failures now also appear completely without any change to the compilers (logs).

In these logs, it looks like Linux was successful while Windows had failures. Am I understanding the logs correctly, @h-vetinari ?

As Maria said, these "Parameter x was incorrect on entry to" errors often relate to incorrect configuration of the LP64/ILP64 interfaces.

Selected domains provide API extensions with the _64 suffix (for example, SGEMM_64) for supporting large data arrays in the LP64 library, which enables the mixing of data types in one application. Are you using the LP64 or ILP64 interface library?

h-vetinari commented 1 week ago

In these logs, it looks like Linux was successful while Windows had failures. Am I understanding the logs correctly, @h-vetinari ?

Yes, the linux issue has been resolved in #84, all the remaining problems are on windows.

Are you using the LP64 or ILP64 interface library?

So far we haven't been actively distinguishing (that I know of) which integer model we use for MKL (though we do for OpenBLAS for example). So the answer is probably whatever Reference-LAPACK (3.9 resp. 3.11) does by default on windows.

How would I be able to set this correctly? Just define -DMKL_LP64=1 resp. -DMKL_ILP64=1? Has the default for this changed in MKL 2025.0 somehow?

ZzEeKkAa commented 1 week ago

May be not the direct answer, but there is a tool from intel to figure out proper linker arguments: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html

h-vetinari commented 1 week ago

Thanks. This suggests to link mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64_dll.lib mkl_tbb_thread_dll.lib mkl_core_dll.lib

So far, we've only needed to point to mkl_rt.2.dll, which is what we've been using as the backend behind the reference-LAPACK interface (which is what we use consistently to compile against, allowing users to choose resp. exchange the actual BLAS implementation in their environments).

Is that not sufficient anymore, presumably?

sknepper commented 1 week ago

One other thought I had - there are some known issues on AMD Windows, which will be fixed in an upcoming patch release (oneMKL 2025.0.1). Was this run on an AMD or Intel system?

h-vetinari commented 1 week ago

I think azure pipelines has various CI agents in their pool, but most are intel AFAIK (Skylake X or so). OTOH, the fact that it's reproducible exactly across 4+ runs also means that it's either independent of the CPU architecture, or that it's happening on all of the agents that we happened to draw.

napetrov commented 1 week ago

One other thought I had - there are some known issues on AMD Windows, which will be fixed in an upcoming patch release (oneMKL 2025.0.1). Was this run on an AMD or Intel system?

in general with those pipelines based on experience it's around 90/10 Intel/AMD ratio that you can expect.