SchoofsKelvin / vscode-sshfs

Extension for Visual Studio Code: File system provider using SSH
GNU General Public License v3.0
546 stars 36 forks source link

Why task ends while using slurm commands? #330

Closed hamad12a closed 2 years ago

hamad12a commented 2 years ago

My task is like the following,

"tasks": {
    "version": "2.0.0",
    "tasks": [
        {
            "label": "test task",
            "type": "ssh-shell",
            "host": "hostname",
            "command": "srun -p short ~/workplace/go ~/workplace/NN/Sn130.txt",
            "problemMatcher": []
        }
    ]
}

Executing the task doesn't wait to receive the output from the host; in contrast for a real-time shell session, the execution of such command lasts for some time to echo the result.

SchoofsKelvin commented 2 years ago

Do you see any output at all? Since v1.24.0 of the extension terminals that exit within 5 seconds should remain open, in case the command failed to execute. Also, is srun globally available, or something added (to your PATH) in a profile script? Depending on your OS and default shell, some profile scripts might not run, making it that your command (or environment variables used by your command) aren't available.

hamad12a commented 2 years ago

No, the command srun is NOT an alias or a function; it is a Slurm command that is available globally (on the server). The terminal remains open but the expected output is absent. Running the task opens a terminal without output, i.e.,


Connecting to xxxx@host...

Terminal will be reused by tasks, press any key to close it.

Additionally, when I login directly in real-time with the terminal I don't find the expected output either.

Other log captured by SSH FS extension reports,

[DEBUG] Starting shell for xxxx@host.: cd "/home/xxxx/workplace"; srun -p short ~/workplace/go ~/workplace/NN/Sn130.txt [DEBUG] Terminal session closed: {"code":0,"status":"open"}

However, replacing the previous command with only sinfo yields the expected output; knowing that the last command takes 1 second to execute, while the first lengthy command takes few seconds.

SchoofsKelvin commented 2 years ago

Additionally, when I login directly in real-time with the terminal I don't find the expected output either.

I assume you mean a "regular" remote SSH terminal created by the extension instead of a task-specific one?

That Terminal session closed log statement, is it logged basically immediately after the Starting shell line? That last line indicates that the remote SSH server reported that the process exited successfully. Only other issue I can think of is a misconfigured (non-loaded) environment, I don't know if srun expects certain environment variables to be set.

Perhaps setting the debug level to the highest might help? A quick glance at the documentation shows that it only logs errors by default, but perhaps it's silently successfully exiting for an unexpected reason it doesn't count as one. Since sinfo yields output, I doubt it's an issue with the extension silently discarding the output.

hamad12a commented 2 years ago

the remote terminal created by the extension is similar to the one that appears in the captured snippet of the extension https://github.com/SchoofsKelvin/vscode-sshfs/raw/master/media/shell-tasks.png The Terminal session closed log statement isn't logged immediately; takes more than one second to appear.

Now, I report the logs appear once the highest debug level is enabled,

> Executing task: test task2 <

Connecting to xxx@host...
slurmstepd: debug level = 6
slurmstepd: debug:  IO handler started pid=18714
slurmstepd: starting 1 tasks
slurmstepd: task 0 (18719) started 2022-03-23T11:55:36
slurmstepd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf)
slurmstepd: debug2: xcgroup_load: unable to get cgroup '(null)/cpuset' entry '(null)/cpuset/system' properties: No such file or directory
slurmstepd: debug2: xcgroup_load: unable to get cgroup '(null)/memory' entry '(null)/memory/system' properties: No such file or directory
slurmstepd: debug:  Sending launch resp rc=0
slurmstepd: debug:  mpi type = (null)
slurmstepd: debug:  Using mpi/none
slurmstepd: debug:  task_p_pre_launch: affinity jobid 11696.0, task:0 bind:8448
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: RLIMIT_STACK  : max:inf cur:inf req:8388608
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
slurmstepd: debug2: _set_limit: RLIMIT_CORE   : max:inf cur:inf req:0
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: RLIMIT_NPROC  : max:128499 cur:128499 req:4096
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:131072 req:1024
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in environment
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615

slurmstepd: task 0 (18719) exited with exit code 0.
slurmstepd: debug:  task_p_post_term: affinity 11696.0, task 0
slurmstepd: debug2: step_terminate_monitor will run for 60 secs
slurmstepd: debug:  step_terminate_monitor_stop signaling condition
slurmstepd: debug2: step_terminate_monitor is stopping
slurmstepd: debug2: Sending SIGKILL to pgid 18714
slurmstepd: debug:  Waiting for IO
slurmstepd: debug:  Closing debug channel

Terminal will be reused by tasks, press any key to close it.

On the other hand, while working interactively I see the output just after the line slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615 in,

slurmstepd: debug level = 6
slurmstepd: debug:  IO handler started pid=18860
slurmstepd: starting 1 tasks
slurmstepd: task 0 (18865) started 2022-03-23T12:03:35
slurmstepd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf)
slurmstepd: debug2: xcgroup_load: unable to get cgroup '(null)/cpuset' entry '(null)/cpuset/system' properties: No such file or directory
slurmstepd: debug2: xcgroup_load: unable to get cgroup '(null)/memory' entry '(null)/memory/system' properties: No such file or directory
slurmstepd: debug:  Sending launch resp rc=0
slurmstepd: debug:  mpi type = (null)
slurmstepd: debug:  Using mpi/none
slurmstepd: debug:  task_p_pre_launch: affinity jobid 11701.0, task:0 bind:8448
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: RLIMIT_STACK  : max:inf cur:inf req:8388608
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
slurmstepd: debug2: _set_limit: RLIMIT_CORE   : max:inf cur:inf req:0
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: RLIMIT_NPROC  : max:128499 cur:128499 req:4096
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
slurmstepd: debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:131072 req:1024
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
slurmstepd: debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in environment
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
  *********** CAS=  4  ********************
 2*J=  0 T-TZ=-1 COUL=0 N=  1 P=0 2*M= 0 C= 0  EXC=   0.00000 E=    -2.36388
 2*J=  4 T-TZ=-1 COUL=0 N=  1 P=0 2*M= 0 C= 0  EXC=   1.24227 E=    -1.12161
 2*J=  8 T-TZ=-1 COUL=0 N=  1 P=0 2*M= 0 C= 0  EXC=   2.03323 E=    -0.33065
 2*J= 12 T-TZ=-1 COUL=0 N=  1 P=0 2*M= 0 C= 0  EXC=   2.25854 E=    -0.10534
 2*J= 16 T-TZ=-1 COUL=0 N=  1 P=0 2*M= 0 C= 0  EXC=   2.35658 E=    -0.00730
 2*J= 20 T-TZ=-1 COUL=0 N=  1 P=0 2*M= 0 C= 0  EXC=   2.45258 E=     0.08870

slurmstepd: task 0 (18865) exited with exit code 0.
slurmstepd: debug:  task_p_post_term: affinity 11701.0, task 0
slurmstepd: debug2: step_terminate_monitor will run for 60 secs
slurmstepd: debug:  step_terminate_monitor_stop signaling condition
slurmstepd: debug2: step_terminate_monitor is stopping
slurmstepd: debug2: Sending SIGKILL to pgid 18860
slurmstepd: debug:  Waiting for IO
slurmstepd: debug:  Closing debug channel

My reply looks lengthy, however, for the sake of debugging one should report all events. Another remark is, after launching the task remotely, as usual, I go to check the output file interactively and I find that an error occurred,

Wed Mar 23 12:06:56 CET 2022
compute27
/home/xxxx/bin/antoine.out: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
Wed Mar 23 12:06:56 CET 2022

the file antoine.out is the executable called by the script go which is included in the task command in question. On the contrary, this kind of error is not reported while working interactively.

The last remark is, I tried to open a shell session THEN execute the previous task command instead; what I mean is replacing the previous task command with,

"command":"srun --slurmd-debug=verbose -p short --pty /bin/bash -l"

This opens an interactive shell in the task terminal, in this occasion I have full control on what I execute; now I run the previous task command interactively,

/home/xxxx/workplace/go /home/xxxx/workplace/NN/Sn130

the result is the same as before; an error occurred in the execution and was reported in the output file

Wed Mar 23 12:50:53 CET 2022
compute27
/home/xxxx/bin/antoine.out: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
Wed Mar 23 12:50:53 CET 202

(You may be confused about the output file; interactively, after successful execution, I obtain both the output file and the stdout of that file printed on the screen of the shell. Hope not a big issue)

hamad12a commented 2 years ago

I am writing another reply but might be deleted if what I'm saying doesn't make sense. Why module task command is reported as bash: module: command not found I want to load modules of Slurm to be able to run srun. I think this is the problem that caused the opening of this whole issue; as you said earlier, the command srun is part of or depends on some modules which should be loaded. Consequently, running the executable throws error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory To come to conclusion, one should find out what's the difference when opening a remote shell with the extension using the command /bin/bash -l & srun -p short --pty /bin/bash -l and opening it manually executing the same command one by one?

SchoofsKelvin commented 2 years ago

Looks like it's indeed an environment/profile issue. This SO question mentions a LD_LIBRARY_PATH in the comments, similar to the answers on this question that was specifically for SSH.

I'll be closing this ticket for now. If it's an issue with the extension (perhaps an issue with it not being a login shell or similar), feel free to comment and reopen it.

hamad12a commented 2 years ago

Indeed the problem was the absence of the LD_LIBRARY_PATH path which wasn't exported in the spawned shell (basically) because the task command srun -p short runs jobs on other nodes other than the head-node. After making use of the one-line command export LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib:/opt/ohpc/pub/compiler/gcc/8.3.0/lib64 in .bash_profile file and not .bashrc file it worked normally.