easybuilders / easybuild-framework

EasyBuild is a software installation framework in Python that allows you to install software in a structured and robust way.
https://easybuild.io
GNU General Public License v2.0
146 stars 199 forks source link

eb fails when using --debug-lmod with lmod 6.5 #2094

Open akesandgren opened 7 years ago

akesandgren commented 7 years ago

Command used: eb HDF5-1.8.17-intel-2017.01.eb --try-toolchain=intelcuda,2016.11 --debug-lmod --debug

fails with: ERROR: Build of /scratch/eb-o1g7T4/tweaked_easyconfigs/HDF5-1.8.17-intelcuda-2016.11.eb failed (err: "build failed (first 300 chars): Failed to read mpath: /hpc2n/eb/modules/all: [Errno 2] No such file or directory: 'mpath: /hpc2n/eb/modules/all'")

akesandgren commented 7 years ago

debuglog.zip

akesandgren commented 7 years ago

And running this command ends up in an "infinite" lmod loop eb HDF5-1.8.17-intel-2017.01.eb --try-toolchain=intelcuda,2016.11

debug log of that attached. inf-loop.zip

akesandgren commented 7 years ago

The /hpc2n/eb/modules/all/Core/icc/2017.1.132-GCC-5.4.0-2.26.lua module contains only one load/use load("GCCcore/5.4.0") prepend_path("MODULEPATH", "/hpc2n/eb/modules/all/Compiler/intel/2017.1.132-GCC-5.4.0-2.26")

The GCCcore only contains a "use" prepend_path("MODULEPATH", "/hpc2n/eb/modules/all/Compiler/GCCcore/5.4.0")

boegel commented 7 years ago

@akesandgren the 2nd issue is really a different problem, so we should split this up in two issues?

I have no clue why the load icc would hang, we'll need to try and get Lmod debug output for it...

boegel commented 7 years ago

@akesandgren Try and see if this allows you to collect Lmod debug output without running into the problem with Lmod debug tripping up module show:

Change this:

if build_option('debug_lmod'):

to this:

if build_option('debug_lmod') and args[0] == 'load':

And then kill the hanging lmod -D python load process rather than cancelling the outer eb process, so EasyBuild captures the stderr/stdout output of the lmod command.

akesandgren commented 7 years ago

Change it where?

Found it...

akesandgren commented 7 years ago

That change seems to have solved the "fails when using --debug-lmod" problem. It's getting further ahead this time.

akesandgren commented 7 years ago

Killing the lmod -D process results in the python process eating memory... LOTS of memory... 140GB of memory or thereabouts...

And errors from the eb process: == sanity checking... Traceback (most recent call last): File "/usr/lib/python2.7/logging/handlers.py", line 77, in emit self.doRollover() File "/usr/lib/python2.7/logging/handlers.py", line 142, in doRollover os.rename(self.baseFilename, dfn) OSError: [Errno 13] Permission denied Logged from file modules.py, line 651

Now up at 220GB, killing...

akesandgren commented 7 years ago

@boegel The debug log from that ended up at 9GB... still want it?

Only 33M bzip:ed

ocaisa commented 7 years ago

What happens if you do a dry run and keep the tweaked files (--disable-cleanup-tmpdir), I think you might have a dependency loop being created from the try options.

On 19 Jan 2017 4:21 pm, "Åke Sandgren" notifications@github.com wrote:

The debug log from that ended up at 9GB... still want it?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hpcugent/easybuild-framework/issues/2094#issuecomment-273804014, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqZteLKL0Phj7CCYWncYri2OI4IMgNKks5rT395gaJpZM4LoG6q .

boegel commented 7 years ago

@akesandgren 33MB zipped is doable, although maybe you can try and kill the lmod process sooner? letting it run for a couple of seconds should be enough...

akesandgren commented 7 years ago

@ocaisa The dry-run generates 2 tweaked configs, HDF5 and Szip, but Szip for intelcuda-2016.11 already exists, as can be seen if i do eb -S 'Szip-intelcuda-2016.11' And both of them look correct compared to the corresponding ones for pure intel toolchain.

akesandgren commented 7 years ago

@boegel Here are logs and other stuff that remains from a --debug_lmod --debug run, where i kill lmod early enough to make the logs smaller. hdf5-debug_lmod.logs.zip

ocaisa commented 7 years ago

Ok but EB will use the tweaked easyconfig for szip not the existing one...this is being fixed in PR https://github.com/hpcugent/easybuild-framework/pull/2090

akesandgren commented 7 years ago

the Szip tweaked config is identical to the existing one except for the buildstats section. So that shouldn't be the problem I hope.

ocaisa commented 7 years ago

If you directly try the tweaked HDF5 easyconfig (without --try- options) does it succeed?

akesandgren commented 7 years ago

Using the tweaked HDF5 config still causes an inf loop on lmod load

ocaisa commented 7 years ago

Then I suspect there is some problem with either your module naming scheme or the toolchain definition

akesandgren commented 7 years ago

Yes, very likely. We have HMNS, minimal-toolchain, and recursive-module-unload=True

There might well also be a bug in the intelcuda toolchain definition. It haven't gotten much testing yet.

ocaisa commented 7 years ago

It's the "recursive-module-unload" that is the problem. Are you using it everywhere or just for toolchains? If you could see the lmod command being called I think you would figure it out. We only use recursive unload as an easyconfig option for compilers, not as a general option.

ocaisa commented 7 years ago

We use Lmod families for MPI and Compilers, having the hierarchy removes the need to enforce recursive unloads

akesandgren commented 7 years ago

We use recursive unload for everything, we like it the way our old tclModules worked.

ocaisa commented 7 years ago

In a hierarchy, that means that when I unload one module everything else become inactive (because the modules that extend the path also get unloaded). For this to work as you would expect you probably need the PR https://github.com/hpcugent/easybuild-framework/pull/2091 that @bartoldeman is working on

akesandgren commented 7 years ago

This is what it looks like here: ml purge ml intelcuda ml

Currently Loaded Modules: 1) snicenvironment (S) 3) iccifort/2017.1.132-GCC-5.4.0-2.26 5) GCCcore/5.4.0 7) CUDA/8.0.44 9) imkl/2017.1.132 2) systemdefault (S) 4) icc/2017.1.132-GCC-5.4.0-2.26 6) ifort/2017.1.132-GCC-5.4.0-2.26 8) impi/2017.1.132 10) intelcuda/2016.11

ml Szip ml

Currently Loaded Modules: 1) snicenvironment (S) 4) impi/2017.1.132 7) GCCcore/5.4.0 10) imkl/2017.1.132 2) systemdefault (S) 5) intelcuda/2016.11 8) ifort/2017.1.132-GCC-5.4.0-2.26 11) Szip/2.1 3) CUDA/8.0.44 6) icc/2017.1.132-GCC-5.4.0-2.26 9) iccifort/2017.1.132-GCC-5.4.0-2.26

ml -Szip ml

Currently Loaded Modules: 1) snicenvironment (S) 2) systemdefault (S) 3) CUDA/8.0.44 4) impi/2017.1.132 5) intelcuda/2016.11

And MODULEPATH is left with LMOD_DEFAULT_MODULEPATH=/hpc2n/eb/modules/all/Core:/hpc2n/eb/software/modulefiles/Linux:/hpc2n/eb/software/modulefiles/Core:/hpc2n/eb/software/lmod/lmod/modulefiles/Core

MODULEPATH=/hpc2n/eb/modules/all/MPI/intel-CUDA/2017.1.132-GCC-5.4.0-2.26-8.0.44/impi/2017.1.132:/hpc2n/eb/modules/all/Compiler/intel-CUDA/2017.1.132-GCC-5.4.0-2.26-8.0.44:/hpc2n/eb/modules/all/Core:/hpc2n/eb/software/modulefiles/Linux:/hpc2n/eb/software/modulefiles/Core:/hpc2n/eb/software/lmod/lmod/modulefiles/Core

That might be the root of the problem, or?

akesandgren commented 7 years ago

The "solution" in this case is to remove the

load("iccifort/2017.1.132-GCC-5.4.0-2.26")
load("imkl/2017.1.132")

lines from the Szip module.

This makes sense to me (is that what PR #2091 does?) since that Szip module can't be loaded unless the correct iccifort+imkl is already loaded.

ocaisa commented 7 years ago

Yes, that's what the PR does

akesandgren commented 7 years ago

Ok, will do some tests with it...

ocaisa commented 7 years ago

What happens if you have two applications that have a shared dependency? When one gets unloaded, does the other cease to work correctly? I think Lmod may be handle this correctly but I'm pretty sure Tcl doesn't

boegel commented 7 years ago

@ocaisa Lmod will mark the 2nd app that requires a dep that gets unloaded as inactive, afaik.

boegel commented 7 years ago

@akesandgren let me know if #2091 fixes your problem, since that would be a very good reason to bump the priority of #2091 and get it included for EB v3.1.0

bartoldeman commented 7 years ago

with default HMNS it would still load imkl since imkl sits on top of the hierarchy I am doing tests too for #2091

boegel commented 7 years ago

but including the load for imkl is fine, right, that wouldn't break things? it's just a dep, not a module that extends $MODULEPATH...

akesandgren commented 7 years ago

Unfortunately PR #2091 doesn't help my problem, the module files end up identical and i still get an inf loop in lmod load during HDF5 build with intelcuda

bartoldeman commented 7 years ago

Here is what is going on for Szip as far as I understand: first of all Szip has no additional dependencies so just the toolchain counts. The intelcuda toolchain module has these dependencies; the one with an x extend MODULEPATH with the default HMNS:

  1. iccifort
  2. CUDA x
  3. icc x
  4. ifort x
  5. impi x
  6. imkl the "x" dependencies are eliminated (because hierarchical_mns.py has this:
    # required for use of iccifortcuda toolchain
    'CUDA,icc,ifort': ('intel-CUDA', '%(icc)s-%(CUDA)s'),

    ); those modules do not have any direct dependencies that are listed here (only GCCcore and binutils), so #2091 does nothing.

I think the solution would be to delete the "iccifort" dependency from the intelcuda toolchain easyconfig. The intel easyconfig does not list iccifort as a dependency (but for some reason does list GCCcore and binutils, which do get eliminated by #2091).

akesandgren commented 7 years ago

Removing iccifort from the intelcuda easyconfig seems to have solved the problem according to my initial test builds.

And when removing iccifort from the intelcuda dep list, i do not need #2091.

I'll rebuild the modules in my primary sw path with that change and see what happens...