TACC / Lmod

Lmod: An Environment Module System based on Lua, Reads TCL Modules, Supports a Software Hierarchy
http://lmod.readthedocs.org
Other
486 stars 125 forks source link

Desired but unexpected behavior? #524

Closed dorrellmw closed 2 years ago

dorrellmw commented 3 years ago

Context: I'm deploying software ABC for a cluster which has "cpu nodes" and "gpu nodes", and ABC needs separate builds for the two kinds of nodes.

I want the users to be able to module load ABC (or module load ABC/2.0.0) and automatically get the correct build for the system which is loading the module (yes, there are caveats to this, but that's not the point of this issue). I have found that this directory structure works:

ABC/
ABC/2.0.0.lua
ABC/.modulerc.lua
ABC/.cpu/
ABC/.cpu/2.0.0.lua
ABC/.gpu/
ABC/.gpu/2.0.0.lua

where ABC/2.0.0.lua has this content:


-- check for node type here and set nodeType variable accordingly

if (nodeType == "cpu") then load("ABC/.cpu/2.0.0") end
if (nodeType == "gpu") then load("ABC/.gpu/2.0.0") end

whatis("Name : ABC")
whatis("Version : 2.0.0")
whatis("Short description : Blah blah blah.")
help([[Blah blah blah.]])

and where ABC/.cpu/2.0.0.lua and ABC/.gpu/2.0.0.lua are the actual modules for the cpu and gpu builds of ABC. The remaining file, ABC/.modulerc.lua, is empty.

When you module load ABC and then run module list, you get this:

Currently Loaded Modules:
  1) ABC/.cpu/2.0.0 (H)

  Where:
   H:  Hidden Module

and module load ABC/2.0.0 or module load ABC/.cpu/2.0.0 all yield the same result. If you now run module load ABC/.gpu/2.0.0, you get this:

The following have been reloaded with a version change:
  1) ABC/.cpu/2.0.0 => ABC/.gpu/2.0.0

and now module load ABC will switch them back. Everything works exactly as I'd like, I can tell users to module load ABC and it will choose the right build transparently behind the scenes.

Is this normal behavior? Why does ABC/.modulerc.lua have to exist for it to work? Are there any corner-cases where this fails?

Thanks!

EDIT: I've noticed a side effect that may or may not be harmless. After unloading the module, env | grep -i ABC gives this result:

__LMOD_REF_COUNT__LMFILES_=/path/to/modulefiles/ABC/2.0.0.lua:1
__LMOD_REF_COUNT_LOADEDMODULES=ABC/2.0.0:1
LOADEDMODULES=ABC/2.0.0
_LMFILES_=/path/to/modulefiles/ABC/2.0.0.lua

However, module unload ABC, module unload ABC/2.0.0, and module purge have no effect. As far as I can tell, there is no way to remove those final traces of ABC/2.0.0 (aside from directly changing the environment variables).

These tests were performed in version 8.2.7.

rtmclay commented 3 years ago

The answer to your first group of questions. The empty .modulerc.lua causes Lmod to treat the two hidden modules as Name-version-version files. You can read about that in the lmod.readthedocs.io pages. But the point is that for all modules the short name is ABC for all 3 modulefiles and the version can be either 2.0.0 or .cpu/2.0.0 etc.

As far as I know there are no downsides to doing it this way except for the problem you have pointed out.

I will have to track down why the ref counts for the ABC/2.0.0 module still lives even though it has been removed.

rtmclay commented 3 years ago

By the way, you could change the name of the .cpu and .gpu modules to be something like: ABC_helper/{.cpu,.gpu} and have both ABC and ABC_helper loaded at the same time.

rtmclay commented 3 years ago

Well that was a subtle bug, but I found the problem. This bug only happens when two modulefiles share the same "sn". This is the shortname. In this case it is "ABC". I have created a new version of Lmod (8.5.12) which solves this problem for me.

When you get a chance, please test Lmod 8.5.12. Thanks very much for the bug report!

rtmclay commented 2 years ago

O.K. to close this issue?

dorrellmw commented 1 year ago

I'm sorry for losing track of this issue, but thank you for resolving the bug!