Open ocaisa opened 2 months ago
That's maybe as expected, but it makes finding a particular version of a particular package a bit more complicated
I think that's the consequence of https://github.com/EESSI/software-layer/blob/2023.06-software.eessi.io/init/modules/EESSI/2023.06.lua#L62 (but I agree that it's annoying...).
Hmm, this is tricky to solve. We want dynamic cache support but we don't want Lmod to try to update the existing cache. Is there a way to do this with the time limits on cache files?
Maybe dynamic cache support "just works", I'll give it a try
No, I tried haveDynamicMPATH()
and it didn't seem to do what I wanted.
What if we create a spider cache for all the EESSI/*
modules?
But then Lmod has to be configured to know where to find it, I guess...
Not sure that is what we want, wouldn't Lmod report the possibilities for every architecture then? We may have to craft an overall solution with the help of the Lmod BDFL
Let's tag him then: @rtmclay
With our modulefiles sitting on SSD, we are moving away from having system spider cache files. I don't know how well the caching works with CVMFS. So maybe you don't need system spider caches at all. This is something you guys need to check.
If you do need spider cache files then I would recommend having a spider cache file for each arch.
You might also be able to use the ideas in https://lmod.readthedocs.io/en/latest/350_community.html to handle the various arch's.
We do indeed have a spider cache per architecture, and we protect these paths with
if ( mode() ~= "spider" ) then
prepend_path("MODULEPATH", eessi_module_path)
end
-- add our spider cache
prepend_path("LMOD_RC", pathJoin(eessi_software_path, "/.lmod/lmodrc.lua"))
The problem is that if you do a spider
search for a software package, Lmod can only see the packages under the new module path after the module is loaded. I guess the issue is not really the protected MODULEPATH
, but that LMOD_RC
is only updated after the module is loaded.
EDIT No, that doesn't seem to be it:
ocaisa@LAPTOP-O6HF2IKC:~$ module purge
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
Lmod has detected the following error: Unable to find: "GROMACS".
ocaisa@LAPTOP-O6HF2IKC:~$ export LMOD_RC=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/.lmod/lmodrc.lua
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
Lmod has detected the following error: Unable to find: "GROMACS".
ocaisa@LAPTOP-O6HF2IKC:~$ module load EESSI
EESSI/2023.06 loaded successfully
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
------------------------------------
GROMACS: GROMACS/2024.1-foss-2023b
------------------------------------
Is it that, in our setup with a variable MODULEPATH
, Lmod doesn't know how the MODULEPATH gets prepended to, so if we define LMOD_RC
and use haveDynamicMPATH()
we might get something that works?
EDIT: That didn't work for me either
Lmod only reports modules that are in the modulepath or are can be found via walking the tree. This code:
if ( mode() ~= "spider" ) then
prepend_path("MODULEPATH", eessi_module_path)
end
-- add our spider cache
prepend_path("LMOD_RC", pathJoin(eessi_software_path, "/.lmod/lmodrc.lua"))
prevents Lmod from knowing about eessi_module_path
when spidering.
It seems to me that you hide modules or you show them. Can you explain exactly what you want with a simple module tree? If you are going to compute spider caches, why not provide them all the time?
Ok, I think I see the issue now. In our case the spider cache and the module path are architecture dependent (so both could be considered to need haveDynamicMPATH()
). The main problem is that we are informing Lmod about that cache in the same module file as we extend the module path, that seems to be too late for Lmod to actually be aware of that cache.
To test this, I split the module file in two, the first does everything except add to the MODULEPATH
, the second load ts the first and does only the MODULEPATH
(and is not protected). Both of these use haveDynamicMPATH()
and they do seem to give the behaviour we want. The use of haveDynamicMPATH()
does introduce a reliance on 8.7.4+ (which is relatively recent, but we can advise people how to work around it).
We could make a big fat spider cache which would remove the need for the base
module to be dynamic, but that probably has it's own downsides.
EDIT:
This is not entirely correct as my session was messed up a little by me tweaking the module file on the fly. The EESSI module file is now:
-- Load all the EESSI environment settings from the matching base
-- module file (which is hidden), including identifying the
-- architecture and setting the appropriate Lmod spider cache to use
always_load(pathJoin('base', '.' .. myModuleVersion()))
-- Add the modulepaths we want
prepend_path("MODULEPATH", os.getenv("EESSI_MODULEPATH"))
prepend_path("MODULEPATH", os.getenv("EESSI_SITE_MODULEPATH"))
haveDynamicMPATH()
if mode() == "load" then
LmodMessage("EESSI/" .. myModuleVersion() .. " loaded successfully")
end
with the general setup being
ocaisa@LAPTOP-O6HF2IKC:~$ module --show-hidden avail
--------------- /home/ocaisa/software-layer/init/modules ---------------
base/.2023.06 (H) EESSI/2023.06
Where:
H: Hidden Module
Trying to search within this context I get:
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
Lmod has detected the following error: Unable to find:
"GROMACS".
ocaisa@LAPTOP-O6HF2IKC:~$ module load base/.2023.06
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
--------------------------------------------------------------------
GROMACS: GROMACS/2024.1-foss-2023b
--------------------------------------------------------------------
Description:
...
so it seems the cache file must visible to Lmod when the module spider
command is invoked, it cannot added to in the dynamic way we want to.
I was on a system where TCL modules was available, but I needed Lmod so I tried to initialise it. This lead to issues when running scripts (as their module tool was initialising itself on top of Lmod in the subshell) so I tried to create a bash function that can be called within scripts to check EESSI and Lmod are available. When testing this on my local machine (which has Lmod), I saw:
alanc@~$ type module
module is a function
module ()
{
local __lmod_my_status;
local __lmod_sh_dbg;
if [ -z "${LMOD_SH_DBG_ON+x}" ]; then
case "$-" in
*v*x*)
__lmod_sh_dbg='vx'
;;
*v*)
__lmod_sh_dbg='v'
;;
*x*)
__lmod_sh_dbg='x'
;;
esac;
fi;
if [ -n "${__lmod_sh_dbg:-}" ]; then
set +$__lmod_sh_dbg;
echo "Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's output" 1>&2;
fi;
eval "$($LMOD_CMD bash "$@")" && eval $(${LMOD_SETTARG_CMD:-:} -s sh);
__lmod_my_status=$?;
if [ -n "${__lmod_sh_dbg:-}" ]; then
echo "Shell debugging restarted" 1>&2;
set -$__lmod_sh_dbg;
fi;
return $__lmod_my_status
}
alanc@~$ echo $LMOD_CMD
/usr/share/lmod/lmod/libexec/lmod
alanc@~$ source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash
Lmod has detected the following error: The following module(s) are unknown: "EESSI/2023.06"
Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore_cache load "EESSI/2023.06"
Also make sure that all modulefiles written in TCL start with the string #%Module
alanc@~$ module av
---------------------------------------------------------- /cvmfs/software.eessi.io/versions/2023.06/init/modules ----------------------------------------------------------
EESSI/2023.06
If the avail list is too long consider trying:
"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
and doing a module reset
gives a hint as to what goes wrong:
alanc@~$ module reset
Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: /cvmfs/software.eessi.io/versions/2023.06/init/modules
Lmod has detected the following error: The following module(s) are unknown: "EESSI/2023.06"
Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore_cache load "EESSI/2023.06"
Also make sure that all modulefiles written in TCL start with the string #%Module
but I don't understand why this happening. We must be missing something in the configuration of Lmod? @MaKaNu ?
My local test VM does not have lmod locally installed, but I registered the following behavior:
$ source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash
EESSI/2023.06 loaded successfully
Okay so far so good. ml av
shows now as expected the eessi arch modules and also the init modules:
------------------------------------------------------------ /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/haswell/modules/all -------------------------------------------------------------
Abseil/20230125.2-GCCcore-12.2.0 HMMER/3.4-gompi-2023a OpenEXR/3.2.0-GCCcore-13.2.0 (D)
Abseil/20230125.3-GCCcore-12.3.0 (D) HPL/2.3-foss-2023b OpenFOAM/v2312-foss-2023a
.
.
.
------------------------------------------------------------------------------ /cvmfs/software.eessi.io/versions/2023.06/init/modules ------------------------------------------------------------------------------
EESSI/2023.06 (L)
If I now try to reset:
$ ml reset
Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: None
EESSI/2023.06 loaded successfully
So here it seems that our EESSI module loaded again and is not removed from $MODULEPATH. I am not sure if this is the behavior we had intended. On the other hand ml unload EESSI/2023.06
works as expected.
@ocaisa Is it enough to install lmod locally or do I also need TCL modules to reproduce your behavior?
I believe a local Lmod installation is enough to test this. I do have something in my .bashrc
:
. /usr/share/lmod/lmod/init/bash
module use /home/alanc/EasyBuild_Git/EB_Devel
but I don't see how that would get triggered
I think the key is Resetting modules to system default.
...but how to see what the system defaults are?
but I don't see how that would get triggered
This might be the reason why module EESSI/2023.06 is already available? In my scenario I just sourced the lmod init same like you did in your .bashrc
but without loading any module. In that case ml av
doesn't work since MODULEPATH
is not set.
So I checked what happens if I now load the EESSI module and repeat the your steps again: same result as before.
I come to the same conclusion, 'Resetting modules to system default' is the difficult part.
What might be a solution:
While writing I realized we set our MODULEPATH and do append. Maybe this is also an issue
Do we really need a default set I wonder if it is going to be problematic? In the initialisation scripts we can just explicitly load the module
So you mean instead we should skip the LMOD init? Yes, I think so.
I mocked something up for bash
:
# Purge any modules before we start
if type module &> /dev/null; then
module purge
fi
# Choose an EESSI version
EESSI_VERSION="${EESSI_VERSION:-2023.06}"
# Initialise Lmod for the shell
. /cvmfs/software.eessi.io/versions/"$EESSI_VERSION"/compat/linux/$(uname -m)/usr/share/Lmod/init/bash
# If an environment exists, let's not mix
if [ ! -z "${MODULEPATH}" ]; then
module unuse "${MODULEPATH}"
fi
# Path to top-level module tree
module use /cvmfs/software.eessi.io/versions/"$EESSI_VERSION"/init/modules
module load EESSI/$EESSI_VERSION
# Purge any modules before we start if type module &> /dev/null; then module purge fi
Seems fine for unloading all existing loaded modules.
# If an environment exists, let's not mix if [ ! -z "${MODULEPATH}" ]; then module unuse "${MODULEPATH}" fi
Not mixing sounds good, but this does not allow for resetting to what ever the user had active before.
If this is fine for the moment we could address it first and find a solution for resetting to Default in a later approach. Since not to be able to reset is no regression, we could approach this.
Further, we might want to advance our test cases.
Also, somehow logical but surprised me:
$ module reset
The system default contains no modules
(env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
No changes in loaded modules
This also means EESSI/2023.06 will not be unloaded.
So, it seems that Lmod can't search inside EESSI until the module is loaded: