Open akesandgren opened 7 years ago
And running this command ends up in an "infinite" lmod loop eb HDF5-1.8.17-intel-2017.01.eb --try-toolchain=intelcuda,2016.11
debug log of that attached. inf-loop.zip
The /hpc2n/eb/modules/all/Core/icc/2017.1.132-GCC-5.4.0-2.26.lua module contains only one load/use load("GCCcore/5.4.0") prepend_path("MODULEPATH", "/hpc2n/eb/modules/all/Compiler/intel/2017.1.132-GCC-5.4.0-2.26")
The GCCcore only contains a "use" prepend_path("MODULEPATH", "/hpc2n/eb/modules/all/Compiler/GCCcore/5.4.0")
@akesandgren the 2nd issue is really a different problem, so we should split this up in two issues?
I have no clue why the load icc
would hang, we'll need to try and get Lmod debug output for it...
@akesandgren Try and see if this allows you to collect Lmod debug output without running into the problem with Lmod debug tripping up module show
:
Change this:
if build_option('debug_lmod'):
to this:
if build_option('debug_lmod') and args[0] == 'load':
And then kill the hanging lmod -D python load
process rather than cancelling the outer eb
process, so EasyBuild captures the stderr/stdout output of the lmod
command.
Change it where?
Found it...
That change seems to have solved the "fails when using --debug-lmod" problem. It's getting further ahead this time.
Killing the lmod -D process results in the python process eating memory... LOTS of memory... 140GB of memory or thereabouts...
And errors from the eb process: == sanity checking... Traceback (most recent call last): File "/usr/lib/python2.7/logging/handlers.py", line 77, in emit self.doRollover() File "/usr/lib/python2.7/logging/handlers.py", line 142, in doRollover os.rename(self.baseFilename, dfn) OSError: [Errno 13] Permission denied Logged from file modules.py, line 651
Now up at 220GB, killing...
@boegel The debug log from that ended up at 9GB... still want it?
Only 33M bzip:ed
What happens if you do a dry run and keep the tweaked files (--disable-cleanup-tmpdir), I think you might have a dependency loop being created from the try options.
On 19 Jan 2017 4:21 pm, "Åke Sandgren" notifications@github.com wrote:
The debug log from that ended up at 9GB... still want it?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hpcugent/easybuild-framework/issues/2094#issuecomment-273804014, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqZteLKL0Phj7CCYWncYri2OI4IMgNKks5rT395gaJpZM4LoG6q .
@akesandgren 33MB zipped is doable, although maybe you can try and kill the lmod
process sooner? letting it run for a couple of seconds should be enough...
@ocaisa The dry-run generates 2 tweaked configs, HDF5 and Szip, but Szip for intelcuda-2016.11 already exists, as can be seen if i do eb -S 'Szip-intelcuda-2016.11' And both of them look correct compared to the corresponding ones for pure intel toolchain.
@boegel Here are logs and other stuff that remains from a --debug_lmod --debug run, where i kill lmod early enough to make the logs smaller. hdf5-debug_lmod.logs.zip
Ok but EB will use the tweaked easyconfig for szip not the existing one...this is being fixed in PR https://github.com/hpcugent/easybuild-framework/pull/2090
the Szip tweaked config is identical to the existing one except for the buildstats section. So that shouldn't be the problem I hope.
If you directly try the tweaked HDF5 easyconfig (without --try- options) does it succeed?
Using the tweaked HDF5 config still causes an inf loop on lmod load
Then I suspect there is some problem with either your module naming scheme or the toolchain definition
Yes, very likely. We have HMNS, minimal-toolchain, and recursive-module-unload=True
There might well also be a bug in the intelcuda toolchain definition. It haven't gotten much testing yet.
It's the "recursive-module-unload" that is the problem. Are you using it everywhere or just for toolchains? If you could see the lmod command being called I think you would figure it out. We only use recursive unload as an easyconfig option for compilers, not as a general option.
We use Lmod families for MPI and Compilers, having the hierarchy removes the need to enforce recursive unloads
We use recursive unload for everything, we like it the way our old tclModules worked.
In a hierarchy, that means that when I unload one module everything else become inactive (because the modules that extend the path also get unloaded). For this to work as you would expect you probably need the PR https://github.com/hpcugent/easybuild-framework/pull/2091 that @bartoldeman is working on
This is what it looks like here: ml purge ml intelcuda ml
Currently Loaded Modules: 1) snicenvironment (S) 3) iccifort/2017.1.132-GCC-5.4.0-2.26 5) GCCcore/5.4.0 7) CUDA/8.0.44 9) imkl/2017.1.132 2) systemdefault (S) 4) icc/2017.1.132-GCC-5.4.0-2.26 6) ifort/2017.1.132-GCC-5.4.0-2.26 8) impi/2017.1.132 10) intelcuda/2016.11
ml Szip ml
Currently Loaded Modules: 1) snicenvironment (S) 4) impi/2017.1.132 7) GCCcore/5.4.0 10) imkl/2017.1.132 2) systemdefault (S) 5) intelcuda/2016.11 8) ifort/2017.1.132-GCC-5.4.0-2.26 11) Szip/2.1 3) CUDA/8.0.44 6) icc/2017.1.132-GCC-5.4.0-2.26 9) iccifort/2017.1.132-GCC-5.4.0-2.26
ml -Szip ml
Currently Loaded Modules: 1) snicenvironment (S) 2) systemdefault (S) 3) CUDA/8.0.44 4) impi/2017.1.132 5) intelcuda/2016.11
And MODULEPATH is left with LMOD_DEFAULT_MODULEPATH=/hpc2n/eb/modules/all/Core:/hpc2n/eb/software/modulefiles/Linux:/hpc2n/eb/software/modulefiles/Core:/hpc2n/eb/software/lmod/lmod/modulefiles/Core
MODULEPATH=/hpc2n/eb/modules/all/MPI/intel-CUDA/2017.1.132-GCC-5.4.0-2.26-8.0.44/impi/2017.1.132:/hpc2n/eb/modules/all/Compiler/intel-CUDA/2017.1.132-GCC-5.4.0-2.26-8.0.44:/hpc2n/eb/modules/all/Core:/hpc2n/eb/software/modulefiles/Linux:/hpc2n/eb/software/modulefiles/Core:/hpc2n/eb/software/lmod/lmod/modulefiles/Core
That might be the root of the problem, or?
The "solution" in this case is to remove the
load("iccifort/2017.1.132-GCC-5.4.0-2.26")
load("imkl/2017.1.132")
lines from the Szip module.
This makes sense to me (is that what PR #2091 does?) since that Szip module can't be loaded unless the correct iccifort+imkl is already loaded.
Yes, that's what the PR does
Ok, will do some tests with it...
What happens if you have two applications that have a shared dependency? When one gets unloaded, does the other cease to work correctly? I think Lmod may be handle this correctly but I'm pretty sure Tcl doesn't
@ocaisa Lmod will mark the 2nd app that requires a dep that gets unloaded as inactive, afaik.
@akesandgren let me know if #2091 fixes your problem, since that would be a very good reason to bump the priority of #2091 and get it included for EB v3.1.0
with default HMNS it would still load imkl since imkl sits on top of the hierarchy I am doing tests too for #2091
but including the load
for imkl
is fine, right, that wouldn't break things? it's just a dep, not a module that extends $MODULEPATH
...
Unfortunately PR #2091 doesn't help my problem, the module files end up identical and i still get an inf loop in lmod load during HDF5 build with intelcuda
Here is what is going on for Szip as far as I understand: first of all Szip has no additional dependencies so just the toolchain counts. The intelcuda toolchain module has these dependencies; the one with an x extend MODULEPATH with the default HMNS:
# required for use of iccifortcuda toolchain
'CUDA,icc,ifort': ('intel-CUDA', '%(icc)s-%(CUDA)s'),
); those modules do not have any direct dependencies that are listed here (only GCCcore and binutils), so #2091 does nothing.
I think the solution would be to delete the "iccifort" dependency from the intelcuda toolchain easyconfig. The intel easyconfig does not list iccifort as a dependency (but for some reason does list GCCcore and binutils, which do get eliminated by #2091).
Removing iccifort from the intelcuda easyconfig seems to have solved the problem according to my initial test builds.
And when removing iccifort from the intelcuda dep list, i do not need #2091.
I'll rebuild the modules in my primary sw path with that change and see what happens...
Command used: eb HDF5-1.8.17-intel-2017.01.eb --try-toolchain=intelcuda,2016.11 --debug-lmod --debug
fails with: ERROR: Build of /scratch/eb-o1g7T4/tweaked_easyconfigs/HDF5-1.8.17-intelcuda-2016.11.eb failed (err: "build failed (first 300 chars): Failed to read mpath: /hpc2n/eb/modules/all: [Errno 2] No such file or directory: 'mpath: /hpc2n/eb/modules/all'")