Open garlick opened 2 years ago
Duplicate of #2801.
Note: forwarding works currently in TOSS 3 for the reasons described in #2801, ssh forwarding on our clusters allows connections over the cluster-local network, and since DISPLAY
is copied from the submission environment, it works almost by accident (as long as you don't log out of the ssh session providing the tunnel on the login node.)
In TOSS 4 (on fluke at least) sshd no long binds to a routable network port, and DISPLAY
is set to localhost:<screen>.0
(sshd may bind only to the localhost address, or it may be using a unix domain socket, I didn't look into it in detail yet). Therefore, I don't think the solution above will work going forward. We may have to look into how to set up port forwarding or X11 tunneling back to login nodes for Flux jobs.
I'm actually unsure this is something that needs support directly in Flux, at least at this early stage, but is more of a general site configuration issue.
There are two possible solutions:
DISPLAY
exported to jobs "just works" (same configuration as TOSS 3)xauth
.Below is a working proof of concept that sets up the correct proxy and xauth. It can be run by hand for now, but the idea is something like this could be run from a job prolog with the job's user credentials (e.g. under runuser
)
There's a lot of assumptions in this implementation (all of which are currently true on fluke at present):
HOSTNAME
in the environment array submitted by the job is the hostname of the node from which the job was submitted. However, HOSTNAME
is not in the list of frequently set or other environment variables listed in POSIX.1-2017. (HOSTNAME
may be a bashism) We probably need a solution to #2875 to solve this problem in general.DISPLAY
is set in the submission environmentDISPLAY
is set to localhost
or nothing (equivalent to localhost I think?). In the case where DISPLAY
is set to a hostname or routable address, this prolog would likely not be required.There are probably some other caveats as well. Individual sites should evaluate this solution in their environment and possibly re-implement a site local solution.
This script, if used from a job prolog, should probably have a corresponding epilog script that kills off the background ssh
process providing the proxy and removes the xauth cookie from the user's Xauthority file.
#!/bin/bash
job_getenv()
{
if test -z "$FLUX_JOB_ENV"; then
FLUX_JOB_ENV=$(flux job info $FLUX_JOB_ID jobspec \
| jq .attributes.system.environment)
fi
echo $FLUX_JOB_ENV | jq -r .$1
}
host=$(job_getenv HOSTNAME)
DISPLAY=$(job_getenv DISPLAY)
if test -z "$host" -o -z "$DISPLAY"; then
echo >&2 "HOSTNAME or DISPLAY not set in environment of job. Aborting.."
exit 0
fi
displayhost=${DISPLAY%:*}
if ! test "$displayhost" = "localhost" -o -z "$displayhost"; then
echo >&2 "DISPLAY hostname is not empty or localhost"
exit 0
fi
display=${DISPLAY#*:}
port=$((${display%.*}+6000))
# Forward local X11 port to login host
ssh -4 -fN -L ${port}:localhost:${port} ${host}
# Add xauth from host
xauth add $DISPLAY . $(ssh $host xauth list $DISPLAY | awk '{ print $3 }')
# vi: ts=4 sw=4 expandtab
I don't think that I ever hear back from our ISSOs about option 1. I'll bring that back up with them.
I was looking at running this from the prolog on fluke. I run it as the user with sudo -u \#${FLUX_JOB_USERID} ...
in the prolog script that is run by perilog-run
. It appears to do what I expect, but the perilog-run
process doesn't complete unless I kill the background ssh
process, so my jobs never really start. Is there a way that I could run this script so that the prolog completes?
Would nohup sudo -u \#${FLUX_JOB_USERID} ...
get the job done?
That doesn't fix it. To clarify, the background ssh
process starts up and the script completes. So, the only thing still running is the background ssh
process. The problem appears to be that the background ssh
process is keeping the flux-imp run prolog
process on the management node (rank 0) and, by extension, the `flux-perilog-run process that spawns it from completing.
Only other thing I can think of is that the background ssh
process is holding the stdout/err file descriptors open. Does running ssh with >/dev/null 2>&1
help? I had mistakenly thought that -f
did this for us, but perhaps not. (nohup
only seems to redirect stdout/err if the current file descriptors point to a terminal, which may be why that doesn't help here.)
Yes! Redirecting stdout/err to /dev/null
fixes it. Thanks @grondo.
@ryanday36 reminded us about this use case in this week's project meeting.
Use case is e.g.:
flux mini alloc -n1 xterm
.For reference, Slurm users have options described in this slurm FAQ entry.