Open h-vetinari opened 2 weeks ago
Only SYCL components of MKL need 2.28 as it is needed by DPC++ runtime.
Defer to @mkrainiuk for the remaining issues.
the simplest upgrade runs into constraints with libhwloc, see here (--> not the fault of this feedstock per se, but cannot test)
That constraint was fixed, and the same 75/95 failures now also appear completely without any change to the compilers (logs).
@mkrainiuk, please advise what's going on here or how we can fix it.
Looks like oneMKL might have some API changes, adding @sknepper for confirmation.
Another potential problem might be the compilation and link with oneMKL are not correct (e.g. the test was built with -DMKL_ILP64
flag but it used LP64 oneMKL interface library), could someone help me to get the exact build logs with compilation and link lines? Unfortunately I can't find this information in the log of failed step from https://github.com/conda-forge/blas-feedstock/pull/128 ...
Thanks for the response!
could someone help me to get the exact build logs with compilation and link lines? Unfortunately I can't find this information in the log of failed step from conda-forge/blas-feedstock#128 ...
In the blas metapackage we only build the tests from https://github.com/Reference-LAPACK/lapack/ and run them against the various blas implementations. The MKL packages themselves aren't built in conda-forge, they're only repackaged, so I cannot offer logs on that. Presumably they should be available somewhere Intel-internally?
e.g. the test was built with
-DMKL_ILP64
flag but it used LP64 oneMKL interface library
Not sure if my info there is incorrect or out of date, but didn't MKL use to build both ILP64 & LP64 symbols into the same library?
That constraint was fixed, and the same 75/95 failures now also appear completely without any change to the compilers (logs).
In these logs, it looks like Linux was successful while Windows had failures. Am I understanding the logs correctly, @h-vetinari ?
As Maria said, these "Parameter x was incorrect on entry to" errors often relate to incorrect configuration of the LP64/ILP64 interfaces.
Selected domains provide API extensions with the _64 suffix (for example, SGEMM_64) for supporting large data arrays in the LP64 library, which enables the mixing of data types in one application. Are you using the LP64 or ILP64 interface library?
In these logs, it looks like Linux was successful while Windows had failures. Am I understanding the logs correctly, @h-vetinari ?
Yes, the linux issue has been resolved in #84, all the remaining problems are on windows.
Are you using the LP64 or ILP64 interface library?
So far we haven't been actively distinguishing (that I know of) which integer model we use for MKL (though we do for OpenBLAS for example). So the answer is probably whatever Reference-LAPACK (3.9 resp. 3.11) does by default on windows.
How would I be able to set this correctly? Just define -DMKL_LP64=1
resp. -DMKL_ILP64=1
? Has the default for this changed in MKL 2025.0 somehow?
May be not the direct answer, but there is a tool from intel to figure out proper linker arguments: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html
Thanks. This suggests to link mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64_dll.lib mkl_tbb_thread_dll.lib mkl_core_dll.lib
So far, we've only needed to point to mkl_rt.2.dll
, which is what we've been using as the backend behind the reference-LAPACK interface (which is what we use consistently to compile against, allowing users to choose resp. exchange the actual BLAS implementation in their environments).
Is that not sufficient anymore, presumably?
One other thought I had - there are some known issues on AMD Windows, which will be fixed in an upcoming patch release (oneMKL 2025.0.1). Was this run on an AMD or Intel system?
I think azure pipelines has various CI agents in their pool, but most are intel AFAIK (Skylake X or so). OTOH, the fact that it's reproducible exactly across 4+ runs also means that it's either independent of the CPU architecture, or that it's happening on all of the agents that we happened to draw.
One other thought I had - there are some known issues on AMD Windows, which will be fixed in an upcoming patch release (oneMKL 2025.0.1). Was this run on an AMD or Intel system?
in general with those pipelines based on experience it's around 90/10 Intel/AMD ratio that you can expect.
In addition to the question whether mkl now really requires
__glibc >=2.28
on linux, I tested MKL 2025.0 against the test suite from netlib lapack, and it seems there's some substantial test failures.testing MKL 2025.0 against LAPACK 3.9.0 together with the switch to flang yields 75/95 failures (logs):
testing MKL 2025.0 against LAPACK 3.11.0 (together with the switch to flang) also yields 75/95 failures (logs):
The reason why I'm almost certain that it's unrelated to the switch to flang, is that MKL 2024.2 + flang only has the following failures (logs):
The errors roughly look as follows
Perhaps this is created to some linkage issue? Was something changed w.r.t. the compiler setup for MKL 2025.0 that could have affected the symbol names?
CC @ZzEeKkAa @Alexsandruss @oleksandr-pavlyk @isuruf