TACC / Lmod

Lmod: An Environment Module System based on Lua, Reads TCL Modules, Supports a Software Hierarchy
http://lmod.readthedocs.org
Other
480 stars 124 forks source link

Inconsistent TCL / Lua behavior within conditionals #618

Closed vanderwb closed 1 year ago

vanderwb commented 1 year ago

Describe the bug Using Cray environment modules with Lmod, I've discovered an issue when using TCL modules that have prepend/append statements in conditionals. If a module is loaded by another module (e.g., a Cray "PrgEnv" module), and then you switch to another "PrgEnv" that loads the same module, the reference counter is erroneously increased for the prepended variable in the conditional, and so if you unload it, the value is not removed from the variable. This is an example of a problematic code:

set pkg_enabled [ info exists env(PKG2_VAR) ]

if { $pkg_enabled } {
   prepend-path PKG1_VAR value
}

Note that the variable being tested is set by another module that is loaded as part of the "PrgEnv" collection.

To Reproduce I have included a reproducer. To see the Lua behavior use run_lua.sh and use run_tcl.sh for the TCL behavior. See my results below:

$ env -i LMOD_ROOT=$LMOD_ROOT USER=$USER ./run_lua.sh
Initial:          PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:1
First remove:     PKG1_VAR=empty   __LMOD_REF_COUNT_PKG1_VAR=
After swap:       PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:1
Second remove:    PKG1_VAR=empty   __LMOD_REF_COUNT_PKG1_VAR=
$ env -i LMOD_ROOT=$LMOD_ROOT USER=$USER ./run_tcl.sh
Initial:          PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:1
First remove:     PKG1_VAR=empty   __LMOD_REF_COUNT_PKG1_VAR=
After swap:       PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:2
Second remove:    PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:1

Expected behavior I would expect both cases to follow the Lua-module behavior in which the reference count stays at 1 when a module is unloaded and then reloaded during a module swap. Apparently this works for TCL modules when run using the original environment module command.

Desktop (please complete the following information):

Modules based on Lua: Version 8.7.14  2022-11-01 10:59 -05:00
    by Robert McLay mclay@tacc.utexas.edu

Changes from Default Configuration
----------------------------------

Name                         Where Set  Default      Value
----                         ---------  -------      -----
LFS_VERSION                  D          1.6.3        1.8.0
LMOD_PACKAGE_PATH            D          nil          <empty>
LMOD_PAGER                   C          less         /usr/bin/less
LMOD_SYSTEM_DEFAULT_MODULES  E          __unknown__  crayenv/22.11:cray-pals/1.2.4:craype/2.7.19:cray-dsmml/0.2.2:cray-libsci/22.11.1.2:PrgEnv-cray/8.3.3:cce/15.0.0:cray-mpich/8.1.21:libfabric/1.15.0.0:cray-pmi/6.1.7:craype-x86-milan:craype-network-ofi
LMOD_SYSTEM_NAME             E          false        gust
LMOD_TCLSH                   C          tclsh        /glade/u/apps/common/22.08/spack/opt/spack/tcl/8.6.12/gcc/7.5.0/bin/tclsh
MODULEPATH_ROOT              E                       /glade/u/apps/gust/modules
PATH_TO_LUA                  C          lua          /glade/u/apps/gust/22.10/spack/opt/spack/lua/5.3.5/gcc/7.5.0/bin/lua
rtmclay commented 1 year ago

Thanks for creating a test case. These help a great deal in tracking down issues. However, your Lua modulefile pkg1/1.0.lua was flawed: You had:

if os.getenv("PKG2_VAR")  ~= "" then
    prepend_path("PKG1_VAR", "value")
end

That won't work as os.getenv("VAR") returns nil if $VAR doesn't exist. So your if stmt was always TRUE.

What you want is something like:

if ((os.getenv("PKG2_VAR") or "")  ~= "") then
    prepend_path("PKG1_VAR", "value")
end

If you do that then you get the same result with both Lua and TCL modulefiles:

$ env -i LMOD_ROOT=$LMOD_ROOT USER=$USER ./run_lua.sh
Initial:          PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:1
First remove:     PKG1_VAR=empty   __LMOD_REF_COUNT_PKG1_VAR=
After swap:       PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:2
Second remove:    PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:1

$ env -i LMOD_ROOT=$LMOD_ROOT USER=$USER ./run_tcl.sh
Initial:          PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:1
First remove:     PKG1_VAR=empty   __LMOD_REF_COUNT_PKG1_VAR=
After swap:       PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:2
Second remove:    PKG1_VAR=value   __LMOD_REF_COUNT_PKG1_VAR=value:1

Please feel free to rework your test case to better illustrate the issue.

rtmclay commented 1 year ago

Any updates?

vanderwb commented 1 year ago

Hi Robert - I've been on PTO for a bit and am trying to catch up with the vendor responses on this issue. I'll update this ticket soon. Thanks for the replies.

vanderwb commented 1 year ago

And now our system has been down for over a week. Frustrating!

In any case, HPE has come back since and just said "our TCL modules aren't supported in Lmod", so I guess that is that. Since you've demonstrated that the behavior is consistent within Lmod for the TCL and Lua case, things are probably working as they should even though this particular TCL module from the Cray stack does not work in Lmod.

So in sum, we will just need to go down the supported path here according to the vendor recommendation. Thank you for taking a look - and feel free to close.

rtmclay commented 1 year ago

O.K. but if you ever get a chance, please submit a bug report on this issue.