firedrakeproject / firedrake

Firedrake is an automated system for the portable solution of partial differential equations using the finite element method (FEM)
https://firedrakeproject.org
Other
516 stars 160 forks source link

Broken MKL at runtime #1562

Closed angus-g closed 1 year ago

angus-g commented 4 years ago

I've managed to build Firedrake on our new HPC platform in Australia, gadi. I had intel-mkl/2019.3.199 loaded as an environment module during the install, which went cleanly. However, running the test suite ran into a lot of errors. For example, running the ma-demo (just because it's the first failing test from pytest -x), I got:

Intel MKL FATAL ERROR: Cannot load libmkl_avx512.so or libmkl_def.so.

Following https://stackoverflow.com/a/37160780/11838997, it seems like MKL isn't being linked in properly somewhere. Indeed, setting LD_PRELOAD=$MKLROOT/lib/intel64/libmkl_core.so:$MKLROOT/lib/intel64/libmkl_sequential.so allows me to run the demo as expected

As far as I can tell, the error cropped up from the u_solv.solve() line. Checking all the shared libraries in the cache directory with ldd, both expected MKL libs are present.

dham commented 4 years ago

Hi @angus-g. It sounds like this is an mkl issue more than it's a Firedrake one, though having the solution documented here may well be useful to there users.

@dacreman has established git repositories in the Firedrake organisation that have build and run scripts for ARCHER and Isambard, two of the UK national supercomputers: https://github.com/firedrakeproject/isambard https://github.com/firedrakeproject/firedrake-archer

We would be very happy to host a similar repo for gadi if you think that would be helpful.

angus-g commented 4 years ago

Thanks @dham, it's certainly a system-specific issue. However, there's a lot on the Firedrake side that I'm not completely across yet, such as the dependency installation, code generation and loading, etc. I thought about trying to test PETSc itself, to see if I can replicate the error.

I think once we get a working build sorted out, having a similar repo for gadi would be really helpful!

ScottMacLachlan commented 4 years ago

I've just encountered this issue in a fresh install on niagara, Compute Canada's resource for "large parallel jobs", and the command-line fix above does seem to work there as well.

@angus-g : did you find a more elegant fix for this?

angus-g commented 4 years ago

Unfortunately not, but I've wrapped Firedrake in an environment module so this can be handled transparently (this shows just the way to activate/deactivate Firedrake, and then environment variables can be exported as usual):

source /opt/Modules/extensions/extensions.tcl
set-basedir -root /g/data/xd2/modules

if { [module-info mode load] || [module-info mode switch2] } {
        puts stdout "source $::basedir/bin/activate;"
} elseif { [module-info mode remove] && ![module-info mode switch3] } {
        puts stdout "deactivate;"
}
ScottMacLachlan commented 4 years ago

Just to complete the documentation of what works for me, I now execute export LD_LIBRARY_PATH=$MKLROOT/lib/intel64/:$LD_LIBRARY_PATH in my job scripts, after loading the MKL module and activating firedrake. I think the real issue here is just that - the MKL module just isn't setting the path correctly...

JDBetteridge commented 1 year ago

I'm closing this issue as it is quite old and seems to have a working solution. Feel free to reopen if you disagree