EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
24 stars 47 forks source link

Teething issue with EESSI module file #694

Open ocaisa opened 2 months ago

ocaisa commented 2 months ago

So, it seems that Lmod can't search inside EESSI until the module is loaded:

alanc@~$ module purge
alanc@~$ module av

------------------------------------------------------------------ /cvmfs/software.eessi.io/init/modules -------------------------------------------------------------------
   EESSI/2023.06

-------------------------------------------------------------------- /home/alanc/EasyBuild_Git/EB_Devel --------------------------------------------------------------------
   Devel (S)    test2

--------------------------------------------------------------------- /usr/share/lmod/lmod/modulefiles ---------------------------------------------------------------------
   Core/lmod    Core/settarg (D)

  Where:
   S:  Module is Sticky, requires --force to unload or purge
   D:  Default Module

If the avail list is too long consider trying:

"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

alanc@~$ module spider GROMACS
Lmod has detected the following error:  Unable to find: "GROMACS".

alanc@~$ module load EESSI
EESSI/2023.06 loaded successfully

alanc@~$ module spider GROMACS

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  GROMACS: GROMACS/2024.1-foss-2023b
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
...
ocaisa commented 2 months ago

That's maybe as expected, but it makes finding a particular version of a particular package a bit more complicated

bedroge commented 2 months ago

I think that's the consequence of https://github.com/EESSI/software-layer/blob/2023.06-software.eessi.io/init/modules/EESSI/2023.06.lua#L62 (but I agree that it's annoying...).

ocaisa commented 2 months ago

Hmm, this is tricky to solve. We want dynamic cache support but we don't want Lmod to try to update the existing cache. Is there a way to do this with the time limits on cache files?

ocaisa commented 2 months ago

Maybe dynamic cache support "just works", I'll give it a try

ocaisa commented 2 months ago

No, I tried haveDynamicMPATH() and it didn't seem to do what I wanted.

boegel commented 2 months ago

What if we create a spider cache for all the EESSI/* modules?

But then Lmod has to be configured to know where to find it, I guess...

ocaisa commented 2 months ago

Not sure that is what we want, wouldn't Lmod report the possibilities for every architecture then? We may have to craft an overall solution with the help of the Lmod BDFL

boegel commented 2 months ago

Let's tag him then: @rtmclay

rtmclay commented 2 months ago

With our modulefiles sitting on SSD, we are moving away from having system spider cache files. I don't know how well the caching works with CVMFS. So maybe you don't need system spider caches at all. This is something you guys need to check.

If you do need spider cache files then I would recommend having a spider cache file for each arch.

You might also be able to use the ideas in https://lmod.readthedocs.io/en/latest/350_community.html to handle the various arch's.

ocaisa commented 2 months ago

We do indeed have a spider cache per architecture, and we protect these paths with

if ( mode() ~= "spider" ) then
    prepend_path("MODULEPATH", eessi_module_path)
end
-- add our spider cache
prepend_path("LMOD_RC", pathJoin(eessi_software_path, "/.lmod/lmodrc.lua"))

The problem is that if you do a spider search for a software package, Lmod can only see the packages under the new module path after the module is loaded. I guess the issue is not really the protected MODULEPATH, but that LMOD_RC is only updated after the module is loaded.

EDIT No, that doesn't seem to be it:

ocaisa@LAPTOP-O6HF2IKC:~$ module purge
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
Lmod has detected the following error:  Unable to find: "GROMACS".

ocaisa@LAPTOP-O6HF2IKC:~$ export LMOD_RC=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/.lmod/lmodrc.lua
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
Lmod has detected the following error:  Unable to find: "GROMACS".

ocaisa@LAPTOP-O6HF2IKC:~$ module load EESSI
EESSI/2023.06 loaded successfully
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS

------------------------------------ 
 GROMACS: GROMACS/2024.1-foss-2023b
------------------------------------
ocaisa commented 2 months ago

Is it that, in our setup with a variable MODULEPATH, Lmod doesn't know how the MODULEPATH gets prepended to, so if we define LMOD_RC and use haveDynamicMPATH() we might get something that works?

EDIT: That didn't work for me either

rtmclay commented 2 months ago

Lmod only reports modules that are in the modulepath or are can be found via walking the tree. This code:

if ( mode() ~= "spider" ) then
    prepend_path("MODULEPATH", eessi_module_path)
end
-- add our spider cache
prepend_path("LMOD_RC", pathJoin(eessi_software_path, "/.lmod/lmodrc.lua"))

prevents Lmod from knowing about eessi_module_path when spidering.

It seems to me that you hide modules or you show them. Can you explain exactly what you want with a simple module tree? If you are going to compute spider caches, why not provide them all the time?

ocaisa commented 2 months ago

Ok, I think I see the issue now. In our case the spider cache and the module path are architecture dependent (so both could be considered to need haveDynamicMPATH()). The main problem is that we are informing Lmod about that cache in the same module file as we extend the module path, that seems to be too late for Lmod to actually be aware of that cache.

To test this, I split the module file in two, the first does everything except add to the MODULEPATH, the second load ts the first and does only the MODULEPATH (and is not protected). Both of these use haveDynamicMPATH() and they do seem to give the behaviour we want. The use of haveDynamicMPATH() does introduce a reliance on 8.7.4+ (which is relatively recent, but we can advise people how to work around it).

We could make a big fat spider cache which would remove the need for the base module to be dynamic, but that probably has it's own downsides.

EDIT:

This is not entirely correct as my session was messed up a little by me tweaking the module file on the fly. The EESSI module file is now:

-- Load all the EESSI environment settings from the matching base
-- module file (which is hidden), including identifying the
-- architecture and setting the appropriate Lmod spider cache to use
always_load(pathJoin('base', '.' .. myModuleVersion()))
-- Add the modulepaths we want
prepend_path("MODULEPATH", os.getenv("EESSI_MODULEPATH"))
prepend_path("MODULEPATH", os.getenv("EESSI_SITE_MODULEPATH"))
haveDynamicMPATH()
if mode() == "load" then
    LmodMessage("EESSI/" .. myModuleVersion() .. " loaded successfully")
end

with the general setup being

ocaisa@LAPTOP-O6HF2IKC:~$ module --show-hidden avail

--------------- /home/ocaisa/software-layer/init/modules ---------------
   base/.2023.06 (H)    EESSI/2023.06

  Where:
   H:  Hidden Module

Trying to search within this context I get:

ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
Lmod has detected the following error:  Unable to find:
"GROMACS".

ocaisa@LAPTOP-O6HF2IKC:~$ module load base/.2023.06
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS

--------------------------------------------------------------------
  GROMACS: GROMACS/2024.1-foss-2023b
--------------------------------------------------------------------
    Description:
...

so it seems the cache file must visible to Lmod when the module spider command is invoked, it cannot added to in the dynamic way we want to.

ocaisa commented 1 month ago

I was on a system where TCL modules was available, but I needed Lmod so I tried to initialise it. This lead to issues when running scripts (as their module tool was initialising itself on top of Lmod in the subshell) so I tried to create a bash function that can be called within scripts to check EESSI and Lmod are available. When testing this on my local machine (which has Lmod), I saw:

alanc@~$ type module
module is a function
module () 
{ 
    local __lmod_my_status;
    local __lmod_sh_dbg;
    if [ -z "${LMOD_SH_DBG_ON+x}" ]; then
        case "$-" in 
            *v*x*)
                __lmod_sh_dbg='vx'
            ;;
            *v*)
                __lmod_sh_dbg='v'
            ;;
            *x*)
                __lmod_sh_dbg='x'
            ;;
        esac;
    fi;
    if [ -n "${__lmod_sh_dbg:-}" ]; then
        set +$__lmod_sh_dbg;
        echo "Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's output" 1>&2;
    fi;
    eval "$($LMOD_CMD bash "$@")" && eval $(${LMOD_SETTARG_CMD:-:} -s sh);
    __lmod_my_status=$?;
    if [ -n "${__lmod_sh_dbg:-}" ]; then
        echo "Shell debugging restarted" 1>&2;
        set -$__lmod_sh_dbg;
    fi;
    return $__lmod_my_status
}
alanc@~$ echo $LMOD_CMD 
/usr/share/lmod/lmod/libexec/lmod
alanc@~$ source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash 
Lmod has detected the following error:  The following module(s) are unknown: "EESSI/2023.06"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "EESSI/2023.06"

Also make sure that all modulefiles written in TCL start with the string #%Module

alanc@~$ module av

---------------------------------------------------------- /cvmfs/software.eessi.io/versions/2023.06/init/modules ----------------------------------------------------------
   EESSI/2023.06

If the avail list is too long consider trying:

"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

and doing a module reset gives a hint as to what goes wrong:

alanc@~$ module reset
Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: /cvmfs/software.eessi.io/versions/2023.06/init/modules
Lmod has detected the following error:  The following module(s) are unknown: "EESSI/2023.06"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "EESSI/2023.06"

Also make sure that all modulefiles written in TCL start with the string #%Module

but I don't understand why this happening. We must be missing something in the configuration of Lmod? @MaKaNu ?

MaKaNu commented 1 month ago

My local test VM does not have lmod locally installed, but I registered the following behavior:

$ source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash
EESSI/2023.06 loaded successfully

Okay so far so good. ml av shows now as expected the eessi arch modules and also the init modules:

------------------------------------------------------------ /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/haswell/modules/all -------------------------------------------------------------
   Abseil/20230125.2-GCCcore-12.2.0                   HMMER/3.4-gompi-2023a                               OpenEXR/3.2.0-GCCcore-13.2.0                     (D)
   Abseil/20230125.3-GCCcore-12.3.0            (D)    HPL/2.3-foss-2023b                                  OpenFOAM/v2312-foss-2023a
.
.
.
------------------------------------------------------------------------------ /cvmfs/software.eessi.io/versions/2023.06/init/modules ------------------------------------------------------------------------------
   EESSI/2023.06 (L)

If I now try to reset:

$ ml reset
Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: None
EESSI/2023.06 loaded successfully

So here it seems that our EESSI module loaded again and is not removed from $MODULEPATH. I am not sure if this is the behavior we had intended. On the other hand ml unload EESSI/2023.06 works as expected.

@ocaisa Is it enough to install lmod locally or do I also need TCL modules to reproduce your behavior?

ocaisa commented 1 month ago

I believe a local Lmod installation is enough to test this. I do have something in my .bashrc:

. /usr/share/lmod/lmod/init/bash
module use /home/alanc/EasyBuild_Git/EB_Devel

but I don't see how that would get triggered

ocaisa commented 1 month ago

I think the key is Resetting modules to system default....but how to see what the system defaults are?

MaKaNu commented 1 month ago

but I don't see how that would get triggered

This might be the reason why module EESSI/2023.06 is already available? In my scenario I just sourced the lmod init same like you did in your .bashrc but without loading any module. In that case ml av doesn't work since MODULEPATH is not set.

So I checked what happens if I now load the EESSI module and repeat the your steps again: same result as before.

I come to the same conclusion, 'Resetting modules to system default' is the difficult part.

What might be a solution:

While writing I realized we set our MODULEPATH and do append. Maybe this is also an issue

ocaisa commented 1 month ago

Do we really need a default set I wonder if it is going to be problematic? In the initialisation scripts we can just explicitly load the module

MaKaNu commented 1 month ago

So you mean instead we should skip the LMOD init? Yes, I think so.

ocaisa commented 1 month ago

I mocked something up for bash:

# Purge any modules before we start
if type module &> /dev/null; then
    module purge
fi

# Choose an EESSI version
EESSI_VERSION="${EESSI_VERSION:-2023.06}"

# Initialise Lmod for the shell
. /cvmfs/software.eessi.io/versions/"$EESSI_VERSION"/compat/linux/$(uname -m)/usr/share/Lmod/init/bash

# If an environment exists, let's not mix
if [ ! -z "${MODULEPATH}" ]; then
    module unuse "${MODULEPATH}"
fi

# Path to top-level module tree
module use /cvmfs/software.eessi.io/versions/"$EESSI_VERSION"/init/modules
module load EESSI/$EESSI_VERSION
MaKaNu commented 1 month ago
# Purge any modules before we start
if type module &> /dev/null; then
    module purge
fi

Seems fine for unloading all existing loaded modules.

# If an environment exists, let's not mix
if [ ! -z "${MODULEPATH}" ]; then
    module unuse "${MODULEPATH}"
fi

Not mixing sounds good, but this does not allow for resetting to what ever the user had active before.

If this is fine for the moment we could address it first and find a solution for resetting to Default in a later approach. Since not to be able to reset is no regression, we could approach this.

Further, we might want to advance our test cases.

MaKaNu commented 1 month ago

Also, somehow logical but surprised me:

$ module reset
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

This also means EESSI/2023.06 will not be unloaded.