Closed hamad12a closed 2 years ago
Do you see any output at all? Since v1.24.0 of the extension terminals that exit within 5 seconds should remain open, in case the command failed to execute. Also, is srun
globally available, or something added (to your PATH
) in a profile script? Depending on your OS and default shell, some profile scripts might not run, making it that your command (or environment variables used by your command) aren't available.
No, the command srun is NOT an alias or a function; it is a Slurm command that is available globally (on the server). The terminal remains open but the expected output is absent. Running the task opens a terminal without output, i.e.,
Connecting to xxxx@host...
Terminal will be reused by tasks, press any key to close it.
Additionally, when I login directly in real-time with the terminal I don't find the expected output either.
Other log captured by SSH FS extension reports,
[DEBUG] Starting shell for xxxx@host.: cd "/home/xxxx/workplace"; srun -p short ~/workplace/go ~/workplace/NN/Sn130.txt [DEBUG] Terminal session closed: {"code":0,"status":"open"}
However, replacing the previous command with only sinfo
yields the expected output; knowing that the last command takes 1 second to execute, while the first lengthy command takes few seconds.
Additionally, when I login directly in real-time with the terminal I don't find the expected output either.
I assume you mean a "regular" remote SSH terminal created by the extension instead of a task-specific one?
That Terminal session closed
log statement, is it logged basically immediately after the Starting shell
line? That last line indicates that the remote SSH server reported that the process exited successfully. Only other issue I can think of is a misconfigured (non-loaded) environment, I don't know if srun
expects certain environment variables to be set.
Perhaps setting the debug level to the highest might help? A quick glance at the documentation shows that it only logs errors by default, but perhaps it's silently successfully exiting for an unexpected reason it doesn't count as one. Since sinfo
yields output, I doubt it's an issue with the extension silently discarding the output.
the remote terminal created by the extension is similar to the one that appears in the captured snippet of the extension https://github.com/SchoofsKelvin/vscode-sshfs/raw/master/media/shell-tasks.png
The Terminal session closed
log statement isn't logged immediately; takes more than one second to appear.
Now, I report the logs appear once the highest debug level is enabled,
> Executing task: test task2 <
Connecting to xxx@host...
slurmstepd: debug level = 6
slurmstepd: debug: IO handler started pid=18714
slurmstepd: starting 1 tasks
slurmstepd: task 0 (18719) started 2022-03-23T11:55:36
slurmstepd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf)
slurmstepd: debug2: xcgroup_load: unable to get cgroup '(null)/cpuset' entry '(null)/cpuset/system' properties: No such file or directory
slurmstepd: debug2: xcgroup_load: unable to get cgroup '(null)/memory' entry '(null)/memory/system' properties: No such file or directory
slurmstepd: debug: Sending launch resp rc=0
slurmstepd: debug: mpi type = (null)
slurmstepd: debug: Using mpi/none
slurmstepd: debug: task_p_pre_launch: affinity jobid 11696.0, task:0 bind:8448
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: RLIMIT_STACK : max:inf cur:inf req:8388608
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
slurmstepd: debug2: _set_limit: RLIMIT_CORE : max:inf cur:inf req:0
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: RLIMIT_NPROC : max:128499 cur:128499 req:4096
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:131072 req:1024
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: debug: Couldn't find SLURM_RLIMIT_MEMLOCK in environment
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
slurmstepd: task 0 (18719) exited with exit code 0.
slurmstepd: debug: task_p_post_term: affinity 11696.0, task 0
slurmstepd: debug2: step_terminate_monitor will run for 60 secs
slurmstepd: debug: step_terminate_monitor_stop signaling condition
slurmstepd: debug2: step_terminate_monitor is stopping
slurmstepd: debug2: Sending SIGKILL to pgid 18714
slurmstepd: debug: Waiting for IO
slurmstepd: debug: Closing debug channel
Terminal will be reused by tasks, press any key to close it.
On the other hand, while working interactively I see the output just after the line slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
in,
slurmstepd: debug level = 6
slurmstepd: debug: IO handler started pid=18860
slurmstepd: starting 1 tasks
slurmstepd: task 0 (18865) started 2022-03-23T12:03:35
slurmstepd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf)
slurmstepd: debug2: xcgroup_load: unable to get cgroup '(null)/cpuset' entry '(null)/cpuset/system' properties: No such file or directory
slurmstepd: debug2: xcgroup_load: unable to get cgroup '(null)/memory' entry '(null)/memory/system' properties: No such file or directory
slurmstepd: debug: Sending launch resp rc=0
slurmstepd: debug: mpi type = (null)
slurmstepd: debug: Using mpi/none
slurmstepd: debug: task_p_pre_launch: affinity jobid 11701.0, task:0 bind:8448
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: RLIMIT_STACK : max:inf cur:inf req:8388608
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
slurmstepd: debug2: _set_limit: RLIMIT_CORE : max:inf cur:inf req:0
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: RLIMIT_NPROC : max:128499 cur:128499 req:4096
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:131072 req:1024
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: debug: Couldn't find SLURM_RLIMIT_MEMLOCK in environment
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
*********** CAS= 4 ********************
2*J= 0 T-TZ=-1 COUL=0 N= 1 P=0 2*M= 0 C= 0 EXC= 0.00000 E= -2.36388
2*J= 4 T-TZ=-1 COUL=0 N= 1 P=0 2*M= 0 C= 0 EXC= 1.24227 E= -1.12161
2*J= 8 T-TZ=-1 COUL=0 N= 1 P=0 2*M= 0 C= 0 EXC= 2.03323 E= -0.33065
2*J= 12 T-TZ=-1 COUL=0 N= 1 P=0 2*M= 0 C= 0 EXC= 2.25854 E= -0.10534
2*J= 16 T-TZ=-1 COUL=0 N= 1 P=0 2*M= 0 C= 0 EXC= 2.35658 E= -0.00730
2*J= 20 T-TZ=-1 COUL=0 N= 1 P=0 2*M= 0 C= 0 EXC= 2.45258 E= 0.08870
slurmstepd: task 0 (18865) exited with exit code 0.
slurmstepd: debug: task_p_post_term: affinity 11701.0, task 0
slurmstepd: debug2: step_terminate_monitor will run for 60 secs
slurmstepd: debug: step_terminate_monitor_stop signaling condition
slurmstepd: debug2: step_terminate_monitor is stopping
slurmstepd: debug2: Sending SIGKILL to pgid 18860
slurmstepd: debug: Waiting for IO
slurmstepd: debug: Closing debug channel
My reply looks lengthy, however, for the sake of debugging one should report all events. Another remark is, after launching the task remotely, as usual, I go to check the output file interactively and I find that an error occurred,
Wed Mar 23 12:06:56 CET 2022
compute27
/home/xxxx/bin/antoine.out: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
Wed Mar 23 12:06:56 CET 2022
the file antoine.out
is the executable called by the script go
which is included in the task command in question. On the contrary, this kind of error is not reported while working interactively.
The last remark is, I tried to open a shell session THEN execute the previous task command instead; what I mean is replacing the previous task command with,
"command":"srun --slurmd-debug=verbose -p short --pty /bin/bash -l"
This opens an interactive shell in the task terminal, in this occasion I have full control on what I execute; now I run the previous task command interactively,
/home/xxxx/workplace/go /home/xxxx/workplace/NN/Sn130
the result is the same as before; an error occurred in the execution and was reported in the output file
Wed Mar 23 12:50:53 CET 2022
compute27
/home/xxxx/bin/antoine.out: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
Wed Mar 23 12:50:53 CET 202
(You may be confused about the output file; interactively, after successful execution, I obtain both the output file and the stdout of that file printed on the screen of the shell. Hope not a big issue)
I am writing another reply but might be deleted if what I'm saying doesn't make sense.
Why module
task command is reported as
bash: module: command not found
I want to load modules of Slurm to be able to run srun
.
I think this is the problem that caused the opening of this whole issue; as you said earlier, the command srun
is part of or depends on some modules which should be loaded. Consequently, running the executable throws error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
To come to conclusion, one should find out what's the difference when opening a remote shell with the extension using the command /bin/bash -l & srun -p short --pty /bin/bash -l
and opening it manually executing the same command one by one?
Looks like it's indeed an environment/profile issue. This SO question mentions a LD_LIBRARY_PATH
in the comments, similar to the answers on this question that was specifically for SSH.
I'll be closing this ticket for now. If it's an issue with the extension (perhaps an issue with it not being a login shell or similar), feel free to comment and reopen it.
Indeed the problem was the absence of the LD_LIBRARY_PATH
path which wasn't exported in the spawned shell (basically) because the task command srun -p short
runs jobs on other nodes other than the head-node. After making use of the one-line command export LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib:/opt/ohpc/pub/compiler/gcc/8.3.0/lib64
in .bash_profile
file and not .bashrc
file it worked normally.
My task is like the following,
Executing the task doesn't wait to receive the output from the host; in contrast for a real-time shell session, the execution of such command lasts for some time to echo the result.