cea-hpc / modules

Environment Modules: provides dynamic modification of a user's environment
http://modules.sourceforge.net/
GNU General Public License v2.0
668 stars 102 forks source link

profile.sh misdetects login shell #473

Closed mw-a closed 1 year ago

mw-a commented 1 year ago

Describe the bug

We're seeing error messages from OpenPBS jobs as generated by the reproducer further down. This happens for users whose login shell is /bin/tcsh but override their job to use /bin/bash for executing the job script. It can be traced to /etc/profile.d/modules.sh which sources /usr/share/Modules/init/tcsh because it wrongly detects the executing shell to be tcsh. Further investigation reveals this to be caused by the optimisation/fix introduced in #173 to use the $BASH variable to make detection more efficient and robust. Unfortunately this runs afoul of two very specific circumstances in our scenario:

To Reproduce

Steps to reproduce the behavior:

  1. Create a user with login shell tcsh:
$ getent passwd $USER
testuser:*:237:1120:test user:/home/testuser:/bin/tcsh
  1. Have them execute the following minimal reproducer command:
$ echo echo '$BASH' | bash -c 'exec -a -bash /bin/bash'
/usr/share/Modules/init/tcsh: eval: line 3: syntax error near unexpected token `alias'
/usr/share/Modules/init/tcsh: eval: line 3: `if ( $?histchars && $?prompt ) alias module 'set _histchars = $histchars; unset histchars; set _prompt=$prompt:q; set prompt=""; eval "`/usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl tcsh \!*:q`"; set _exit="$status"; set histchars = $_histchars; unset _histchars; set prompt=$_prompt:q; unset _prompt; test 0 = $_exit' ;'
/usr/share/Modules/init/tcsh: line 15: syntax error near unexpected token `alias'
/usr/share/Modules/init/tcsh: line 15: `      if ( $?histchars && $?prompt ) alias module 'set _histchars = $histchars; unset histchars; set _prompt=$prompt:q; set prompt=""; eval `/usr/share/Modules/libexec/modulecmd-compat tcsh \!*`; set _exit="$status";  set histchars = $_histchars; unset _histchars; set prompt=$_prompt:q; unset _prompt; test 0 = $_exit' ;'
/bin/tcsh

No modulerc or modulefiles are involved in this problem because it's the initialisation process that's failing.

Expected behavior

The shell should execute the command without error messages.

Error and debugging information

It could be argued whether bash's or PBS's behaviour are strictly correct here. On the other hand, changing their established behaviour, particularly that of bash, is likely to break other things. So any ideas how to work around this problem on the modules level would be very much appreciated.

Modules version and configuration

$ module --version
Modules Release 4.5.2 (2020-07-30)
$ module config --dump-state
Modules Release 4.5.2 (2020-07-30)

- Config. name ---------.- Value (set by if default overridden) ---------------
advanced_version_spec     0
auto_handling             0
avail_indepth             1
avail_report_dir_sym      1
avail_report_mfile_sym    1
collection_pin_version    0
collection_target         <undef>
color                     never
colors                    hi=1:db=2:se=2:er=91:wa=93:me=95:in=94:mp=1;94:di=94:al=96:sy=95:de=4:cm=92
contact                   root@localhost
extended_default          0
extra_siteconfig          <undef>
home                      /usr/share/Modules (env-var)
icase                     never
ignored_dirs              CVS RCS SCCS .svn .git .SYNC .sos
implicit_default          1
locked_configs
ml                        1
pager                     /usr/bin/less -eFKRX
rcfile                    <undef>
run_quarantine            <undef>
search_match              starts_with
set_shell_startup         0
silent_shell_debug        <undef>
siteconfig                /etc/environment-modules/siteconfig.tcl
tcl_ext_lib               /usr/lib64/libtclenvmodules.so
term_background           dark
unload_match_order        returnlast
verbosity                 normal
wa_277                    0

- State name -----------.- Value ----------------------------------------------
cmdline                   /usr/share/Modules/libexec/modulecmd.tcl tcsh config --dump-state --no-pager
extra_siteconfig_loaded   0
force                     0
is_stderr_tty             1
is_win                    0
machine                   x86_64
os                        Linux 4.18.0-372.9.1.el8.x86_64
pager_started             0
paginate                  0
path_separator            :
rc_loaded                 <undef>
siteconfig_loaded         1
shell                     tcsh
shelltype                 csh
subcmd                    config
subcmd_args               --dump-state
tcl_ext_lib_loaded        1
tcl_version               8.6.8
term_columns              158

- Env. variable --------.- Value ----------------------------------------------
LOADEDMODULES
MODULEPATH                /app/modulefiles:/app/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles
MODULEPATH_modshare       /usr/share/modulefiles:1:/usr/share/Modules/modulefiles:1:/etc/modulefiles:1
MODULESHOME               /usr/share/Modules
MODULES_CMD               /usr/share/Modules/libexec/modulecmd.tcl
MODULES_USE_COMPAT_VERSION 0

Additional context

xdelaruelle commented 1 year ago

Many thanks for your detailed report and for the reproducer.

As spotted, issue comes from the following lines of sh profile script:

https://github.com/cea-hpc/modules/blob/b5c3c30477fee50c1ef51b62f7a56ead4c70f88c/init/profile.sh.in#L5-L6

Code may be fixed by restraining the possibilities when detecting that the BASH variable is set. If BASH is found set, I assume we need to source either sh or bash version of the Modules initialization script. So I think the following code change should fix the issue:

--- orig    2022-09-13 06:55:10.807140403 +0200
+++ /etc/profile.d/modules.sh   2022-09-13 06:55:42.119445769 +0200
@@ -1,7 +1,11 @@
 # get current shell name by querying shell variables or looking at parent
 # process name
 if [ -n "${BASH:-}" ]; then
-   shell=${BASH##*/}
+   if [ "${BASH##*/}" = 'sh' ]; then
+      shell='sh'
+   else
+      shell='bash'
+   fi
 elif [ -n "${ZSH_NAME:-}" ]; then
    shell=$ZSH_NAME
 else
mw-a commented 1 year ago

Thanks for the quick response and sorry for not responding as quickly.

I have now tested the proposed fix in our setup and can confirm that it fixes the problem with the minimal reproducer as well as for PBS jobs. Thank you!

lzaoral commented 1 year ago

@mw-a, thank you for this detailed report! @xdelaruelle, thank you for letting me know that you have created a fix for this bug (rhbz#1815047)!

I've already fixed it in CentOS Stream 8 & 9 so this will be fixed in upcoming RHEL releases.