add mkl_fft and llvmlite to software stack?

moustakas commented 2 months ago

As part of a major performance-focused restructuring of https://github.com/desihub/fastspecfit that I'm doing with @jdbuhler, we would like to request two additional packages be added to DESI conda stack:

However, these packages are not compatible with the currently pinned version of mkl=2020.0. There's a comment in https://github.com/desihub/desiconda/blob/main/conf/conda-pkgs.sh that says:

mkl=2020.0 because that is the last version that guarantees bitwise identical output for bitwise idential input

Is there a report, thread, or ticket which discusses this issue?

The latest version of mkl is 2023.1.0. Can we test the effect of upgrading this package? https://anaconda.org/anaconda/mkl/files

jdbuhler commented 2 months ago

On 8/6/2024 1:23 PM, Moustakas wrote:

As part of a major performance-focused restructuring of https://github.com/desihub/fastspecfit we would like to request two additional packages be added to DESI conda stack:

https://github.com/numba/llvmlite

https://github.com/IntelPython/mkl_fft

John -- I think llvmlite is already there as part of Numba. The other package I was hoping for besides mkl_fft is the Intel SVML library, which is installed via the package 'intel-cmplr-lib-rt'.

Jeremy

sbailey commented 2 months ago

Context on pinning mkl=2020.0. When porting to cori, we found that numpy/scipy.linalg.eigh had non-reproducible answers even when run back-to-back on the same input in the same python script on the same machine. @lastephey traced this to an MKL issue and put together a reproducer at https://github.com/lastephey/eigh-mkl-bug. She reported that to Intel via our internal contacts on the NESAP team, and they filed internal Intel tickets about it, concluding that

Thank you for the detailed reproducer! I was able to reproduce the behavior, but this is not a bug. Intel MKL does not guarantee bit-wise identical results by default, as there may be a performance impact to do so. However, Intel MKL does offer a conditional numerical reproducibility feature that will provide reproducible results, subject to some limitations. For instance, there are some codepaths in Intel MKL that rely on aligned data, but if CNR mode is enabled then those codepaths are not taken regardless of data alignment.

If I enable CNR mode via: export MKL_CBWR=AUTO Then the test passes.

You can read more about reproducibility: https://software.intel.com/content/www/us/en/develop/articles/introduction-to-the-conditional-numerical-reproducibility-cnr.html https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/obtaining-numerically-reproducible-results/getting-started-with-conditional-numerical-reproducibility.html

If run-to-run bit-wise reproducibility is needed for Python, then perhaps CNR mode should be enabled by default when using Intel MKL.

Their argument is basically that reproducibility at the machine round-off level is meaningless and we shouldn't worry about it, favoring performance gains instead. In practice, that makes testing a huge pain because the output always changes and you have to check in detail every time whether the change was bad or not.

While they were sorting out the export MKL_CBWR=AUTO option, we found that mkl=2020.0 was the last version that did not have this problem and we stuck with that and haven't seriously explored the MKL_CBWR=AUTO option (performance impacts, whether it really fixes the problem in all cases, etc.). At some point this is going to bite us due to our other required packages needing newer MKL, but we're not there yet.

Regarding mkl_fft, in general I'm hesitant to bring in more required dependencies, and we're already in a bit of dependency hell with the desiconda packages, so there has to be a pretty strong case for the improvement offered by mkl_fft. I also view MKL-specific dependencies as risky since the future isn't guaranteed to be Intel/MKL-compatible, e.g. I wouldn't be surprised if some future machine used something like https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/ and wasn't 100% MKL compatible and we use some MKL-alternative instead. Ideally that would be transparent at the numpy/scipy level, but if we find that we need to install MKL-specific dependencies and our code doesn't work if they aren't installed (vs. just running faster if they are installed), that's a warning flag for future maintainability.

desihub / desiconda

add mkl_fft and llvmlite to software stack? #71