evalf / nutils

The nutils project
http://www.nutils.org/
MIT License
88 stars 48 forks source link

MKL 2024.0.0 Pardiso fails on Windows #843

Open gertjanvanzwieten opened 8 months ago

gertjanvanzwieten commented 8 months ago

With the release of Intel MKL version 2024.0.0 on Nov 14 2023, certain invocations of libmkl.pardiso fail on Windows with OSError: [WinError -1066598274] Windows Error 0xc06d007e. An example of failing unit tests can be found at https://github.com/evalf/nutils/actions/runs/7247258878/job/19740839287.

It is unclear what causes these failures. Since we don't have access to a Windows platform to investigate the issue locally, we resorted to placing an upper bound on the supported MKL version (#837, #835) until a better solution presents itself. We will keep this issue open until the bound can be removed.

gertjanvanzwieten commented 8 months ago

Hi @ghsilva, I noticed your message to PyPardiso https://github.com/haasad/PyPardiso/issues/58#issuecomment-1857227234 in which you offer help to get rid of the "<2024" workaround. We would also love to remove it and welcome any insights you might have on the matter. Additionally, perhaps our independent bindings can provide an extra data point to establish what causes the latest version to fail on Windows.

ghsilva commented 8 months ago

Hello @gertjanvanzwieten, in case of PyPardiso it seems to fail in Windows due to multiple copies of openMP installed in the system. PyPardiso is using pip to install dependencies, is nutils also using pip or something else (i.e conda, conda-intel, conda-forge)? I will check our Windows distribution channels for Python to try to find the problem.

gertjanvanzwieten commented 8 months ago

Hi @ghsilva, thanks for your feedback. Indeed we use pip to install MKL for Github actions, which is the only platform we have available to study the issue. As such it is a bit difficult to establish what other libraries exist on the system. The strange thing is that there appear to be two failure modes: in some cases the linear system is solved fine but the WinError is raised during memory release (phase -1), in other cases it is the factorization that fails (phase 12). Could the presence of multiple installed versions account for this?

ghsilva commented 8 months ago

@gertjanvanzwieten it doesn't sound like the same problem that we identified in PyPardiso, please take a look in my comment there. I am trying to reproduce the issue in my system, but by default the "python -m build" in nutils does not use oneMKL. Could you please share the build command and how to run the tests to be sure they are using oneMKL on Windows? I noticed in the GitHub Actions that you are using oneMKL with one thread only, is that specific to the test environment or is it how you want oneMKL to behave with nutils always? I will take a look in better ways to use sequential oneMKL as part of you library if that what nutils really needs and create another issue to avoid misusing the current issue that we are tracking here.

gertjanvanzwieten commented 8 months ago

@ghsilva indeed the symptoms do seem to be a bit different in our case, so maybe we are looking at different issues both brought on by the new version. Thanks for looking into it.

There is no build command in nutils, you should be able to pip install and then run any of the example scripts, or python -m unittest to run the unit tests. Set NUTILS_MATRIX=mkl to be sure that the Pardiso solver is used.

In case MKL is installed through pip we set the environment variable NUTILS_MATRIX_MKL_LIB to the directory that contains libmkl_rt, as this is not part of the loadlib search path. For github actions we use this script to detect and configure it: https://github.com/evalf/nutils/blob/master/devtools/gha/configure_mkl.py.

The single threaded configuration is unrelated and optional.

ghsilva commented 8 months ago

@gertjanvanzwieten I had the following setup that fails with PyPardiso and nutils: $> conda create -n test_nutils $> conda activate test_nutils $> pip show mkl

Name: mkl Version: 2024.0.0

$> python --version

Python 3.11.7

$> conda env config vars set NUTILS_MATRIX=mkl $> conda env config vars set NUTILS_DEBUG=all $> conda activate test_nutils $> conda env config vars list

NUTILS_MATRIX = mkl NUTILS_DEBUG = all

$> python -m build $> pip install .\dist\nutils-9a13-py3-none-any.whl --force-reinstall

$> python -m unittest #All tests fail, I believe because nutils could not load mkl dll:

====================================================================== ERROR: nutils.matrix (unittest.loader._FailedTest.nutils.matrix)

ImportError: Failed to import test module: nutils.matrix File "C:\Users\GHSILVA\Documents\nutils\nutils\matrix_mkl.py", line 18, in raise BackendNotAvailable('the Intel MKL matrix backend requires libmkl to be installed (try: pip install mkl)') ... nutils.matrix.BackendNotAvailable: the Intel MKL matrix backend requires libmkl to be installed (try: pip install mkl)

Installing python within conda is a workaround to fix the intel-openmp2024.* dll issue:

$> conda install python=3.10.11 #using same version of the github actions

$> pip install treelog stringly bottombar appdirs nutils_poly psutil matplotlib

On the command below, you could install regular numpy, I do not believe intel extension to numpy was necessary to work:

$> python -m pip install -i https://pypi.anaconda.org/intel/simple numpy

Successfully installed intel-cmplr-lib-rt-2024.0.0 intel-openmp-2024.0.0 mkl-2024.0.0 mkl-service-2.4.0 mkl_fft-1.3.8 mkl_random-1.2.4 mkl_umath-0.1.1 numpy-1.24.4 six-1.16.0 tbb-2021.11.0 tbb4py-2021.11.0

$> python -m unittest

It completed the tests with the following message:


Ran 12440 tests in 684.368s

OK (skipped=152)

I will try to reproduce the environment closer to what you have on GitHub action tests as I am getting different error messages

gertjanvanzwieten commented 8 months ago

(@ghsilva sorry for the late reply, I missed a notification)

If the (local) pip installation directory is not in LD_LIBRARY_PATH, then the library can alternatively be specified using the dedicated variable NUTILS_MATRIX_MKL_LIB=/path/to/libmkl_rt.so.2. This should fix the non-conda installation. Conda probably has its own way to set these paths which explains the passing tests. Indeed, they pass on our end as well (on linux) which makes the issue hard to study. The only failing platform so far is on the Windows platform provided by Github actions.

Thanks for your continued attention to this issue!