Closed DrDaveD closed 5 years ago
Hi @DrDaveD , for /dev/pts
issue and user namespace, the problem seems to come from newinstance,ptmxmode
not allowed within a user namespace. To start a discussion, a possible solution :
newinstance,ptmxmode
when using user namespace and in this case instead of creating symlink /dev/ptmx
-> /dev/pts/ptmx
, just bind host /dev/ptmx
into container, so terminal allocation will work from container with the counterpart to share indices with host but without exposing other allocated terminals. Thoughts ?Regarding terminal allocation for each shell/exec/run/start command, I totally agree and noticed that during OCI engine development. That will be fixed and /dev/console
bind too.
@cclerget That appears to be the only option. However, seeing that new kernels have support for that ( https://github.com/torvalds/linux/commit/e98d41370392dbc3e94c8802ce4e9eec9efdf92e#diff-6d314687c035fb9da87d65b2b9ffb698L281 ) and that RHEL is not unwilling to backport this to RHEL 7 ( https://bugzilla.redhat.com/show_bug.cgi?id=1522992 ) this should probably only be done as a fallback.
If a future RHEL 7 release (and maybe RHEL 8) support mounting devpts without fakeroot
, we should take advantage of that (and distros with modern kernels already have that).
For some tools like sshd
, there's sadly still https://bugzilla.mindrot.org/show_bug.cgi?id=2813 , so it won't help with those until the tools are adapted. But it would be a start.
I think that binding /dev/ptmx from the host might have the same effect and there's no good reason to try to recreate it or to wait for the RHEL 7 bug fix. It will take a little testing I guess.
I think that binding /dev/ptmx from the host might have the same effect
If it has, I'm for it. It's just curious that a kernel feature exists, so I wonder if we are either missing something (may become visible in testing) or there is a security issue (which I also do not see, though).
@DrDaveD @olifre After some tests with centos 7.4, devpts mount no matter options passed, always returns invalid argument with user namespace, I tested on Ubuntu with ptmx bind mount and that doesn't work.
Hi @cclerget, I tried mounting devpts without "newinstance,ptmxmode" as a non-root user in a user namespace on CentOS7: I'm also still not able to mount. I believe the kernel patch @olifre mentioned is necessary for this to work.
@DrDaveD, @olifre, if one bindmounts /dev/ptmx from the host, and the host devpts is not accessible as /dev/pts in the container, any ptys newly allocated by applications in the container won't be visible. For example, running sshd or xterm in the container would fail because these processes wouldn't be able open their slave (/dev/pts/X) devices that devpts creates for them (via /dev/ptmx): the /dev/pts/X device files would not exist. Bind mounting the host /dev/pts would expose all host pty devices, which is undesirable with -C. That is why a new instance of devpts is being mounted in the container.
I agree it would be best to use a new pty when running with -C.
I agree it would be best to use a new pty when running with -C.
Ok! So the only ways to go then would be to use fakeroot
mode, i.e. map to uid
and gid
of 0 in the container, or to wait for the fix, correct?
If somebody wants to support my point in the RedHat bugtracker, let me know the mail address of your bugtracker account and I can CC you, then you should be able to access it. I've bumped it with a comment just today, I can let you all know here if there are any updates on it.
@olifre fakeroot
won't help, unshare -U -r -m mount -t devpts devpts /test
returns invalid argument
on CentOS 7, there is nothing that can be done on Singularity side, the kernel fix is definitively required.
@cclerget, you need to pass "newinstance" to the unshare mount above. The below works for me on CentOS 7.4: [chris@c7vm ~]$ unshare -U -r -m [root@c7vm ~]# id uid=0(root) gid=0(root) groups=0(root),65534(nfsnobody) [root@c7vm ~]# mount devpts test -tdevpts -omode=0620,newinstance,ptmxmode=0666,noexec,nosuid [root@c7vm ~]# ls test ptmx
Newer kernels got rid of the newinstance option (it is the default in newer kernels, so that parameter is ignored). Mounting a devpts without newinstance in older kernels is somewhat dangerous, as it can overide the mount options for the original devpts mount. I believe that is why mounting devpts without that option isn't permitted for in a user/mount namespace.
@olifre also confirmed that --fakeroot works in #1186. I can't seem to find the --fakeroot option in 2.6.0... I guess it was never merged into master, or was removed?
@DrDaveD, @olifre, @cclerget,
There are two somewhat separate issues in this ticket:
@hollowec thanks for the good summary of where we're at.
Just to make doubly-sure that bind-mounting /dev/pts/ptmx from the host won't work, I made a quick hack to a 2.6.0-with-underlay that created that bind mount point and then ran it with -u -c --ipc --pid -B /dev/pts/ptmx. Then I attempted to allocate a pty with mypty and it not only didn't show the device in /dev/pts, it failed with a Permission denied error. /dev/pts/ptmx became owned by the nobody user and group (65534). So that's for sure out.
I also looked at the 3.x master code for where the namespace uid/gid mapping is done and found that /proc/$$/uid_map is set up immediately after the call to unshare(CLONE_NEWUSER), in the same function, which is actually in low-level C (cmd/starter/c/starter.c). The user_namespaces man page says that uid_map may only be written to once. So singularity would have to go through quite a heroic & messy effort to support pseudo-tty allocation with unprivileged user namespaces on RHEL 7.x before RHEL backports the improvement, in order to set up this one device before mapping the uid.
This isn't a show-stopper for widespread use of unprivileged singularity on RHEL 7.x with htcondor, is it? If I understand correctly, it means that condor_ssh_to_job won't work, but I don't know how vital that is.
Assuming it is not a show-stopper, in summary what we need from singularity-3 for this issue is to change it to not die when devpts cannot be set up and to allocate one pseudo-tty from the host for the container user when singularity is run from a tty.
In fact, the "heroic effort" I talked about isn't even possible. unshare -r option sets up the uid_map so that your unprivileged user id is mapped to zero, but uid_map may only be written to once so it can't be changed again. The default before uid_map is set makes every uid map to 65534. Even with that devpts newinstance created from a fake root, pseudo-tty allocation gets Permission denied.
A single pseudo-tty allocation from the host is probably not even possible as an unprivileged user after unshare has been run. Probably the allocation needs to be done before unshare.
Hi @DrDaveD,
As you mentioned, lack of pty allocation capability is not a showstopper for most people using HTCondor. It only affects those using "condor_submit -interactive", or condor_ssh_to_job, as you mentioned.
Is ptmxmode=0666 set in your flags for the new instance mount of devpts?
Also, /dev/ptmx and /dev/pts/ptmx on the CentOS host system are effectively be the same device, but /dev/ptmx will have different permissions. You can try bindmounting /dev/ptmx in directly instead of /dev/pts/ptmx. Most applications use /dev/ptmx (not /dev/pts/ptmx) when allocating a pty.
Yes, I used the same flags as you except made the mount point /dev/pts. mypty still gave Permission denied.
I tried again just now and noticed that /dev/ptmx was still owned by 65534 even though /dev/pts/ptmx wasn't. So I did mount --bind /dev/pts/ptmx /dev/ptmx to eliminate that. mypty still gave Permission denied.
I didn't know about the different permissions of /dev/ptmx and /dev/pts/ptmx, but singularity symlinks /dev/pts/ptmx to /dev/ptmx and I had tried temporarily making /dev/pts/ptmx mode 666 when testing bind mounting to the hacked singularity. So now I also tried as you suggested to use -B /dev/ptmx with my hacked singularity-2.6.0 (with the bind mount point there instead of at /dev/pts/ptmx). This time it gave a error 2 (No such file or directory) instead of 13 (Permission denied). (Now I'm not sure if I might have gotten that error before and not noticed that the number had changed from 13). With strace I can see it opening /dev/ptmx successfully but then it fails after a call to statfs. The posix_openpt source code shows it checking to see if it is a devpts type filesystem and otherwise returning ENOENT, so that's what happened. If I change it to just open /dev/ptmx directly, then it fails on the grantpt with error 2 because it can't open the new /dev/pts/N device. So it is just as you said it would be: it's not usable inside the container because of the missing /dev/pts devices.
Thanks @hollowec ! Ok so that work as expected with fake root within user namespace : singularity shell -c -u --fakeroot dir
, I should have tried this first :/ ...
This isn't a show-stopper for widespread use of unprivileged singularity on RHEL 7.x with htcondor, is it?
It is for us, and probably other sites with many local users. My reply in the following is a bit off-topic, but since you raised the point, let me explain why. Anybody not interested, please ignore the rest of this comment ;-).
One huge advantage of containerisation is that users can choose their environment, and one large advantage of workload management systems is that they take care of scheduling jobs to a large set of heterogeneous resources. So the way to go is to combine both: Expose heterogeneous resources with containers to users, ideally granting them the very same environment to develop their code and to actually run the jobs. The only clean way to do that is to give them access to the very same environment (=containers) for both development and execution. Many users can't be (and should not be!) bothered with setting up their own container, or even choosing it. They want to choose the "HEP-ready" runtime environment matching their usecase, which will usually be SL6 with HEPOSlibs, or CentOS7 with HEPOSlibs. So what we do at our site is offer them a classad attribute in their HTCondor Job Description file, with which they can choose between those. They can then either submit an interactive job (to do development, including graphical applications) or a batch job, and get the very same environment for both usecases. With an interactive job, they can access large nodes with 128 GB of memory and 56 cores, if needed.
Furthermore, we have users from other fields (theory, applied physics, optics, condensed matter, hadron physics, numerics...) who may want to use something more modern instead of those legacy OSes. They can choose Ubuntu 18.04, which matches what we give them on their centrally managed desktop machines.
These heterogeneous usecases are very common at T2/T3 sites, University or institute clusters. For all these, there are three ways to go:
So yes, for us it is a show-stopper. But it's not as bad as it seems, for several reasons:
sshd
run outside of the container. While I love that there is a working singularity support layer right now in HTCondor contributed by @bbockelm , it has it's flaws due to the way HTCondor does things right now. This can only fixed by restructuring in HTCondor, so that will take some time. Ideally, sshd
will run outside and talk to the inside of the container by joining the namespaces via nsenter
. So what do we do right now? We run privileged on a platform which fully supports user namespaces, to support running sshd in the containers. That's a risk we have to take to support a modern setup with many local users. A larger site (T2, T1) can probably afford to go another way, since they are running more homogeneous grid jobs, and they just push local users out to institute or University clusters, or to T3s. Smaller sites with a large variety of users and running HTCondor can profit hugely from using interactive jobs with containers.
So in short, I think this feature is critical, but there are several ways to solve it - and CentOS 8 is not too far away. I'm still anxious about the OpenSSH guys not responding anymore, but maybe HTCondor will surprise me and implement something sooner than expected.
@DrDaveD : And thanks for supporting my point in the RHEL bugtracker, that's much appreciated!
The discussion at https://bugzilla.mindrot.org/show_bug.cgi?id=2813 has got some drive again.
A workaround has been found: Not having the group tty
in the /etc/group
file mapped into the container prevents sshd
from enforcing that the pts
devices are owned by the tty
group.
I am unsure if this workaround would also help with other tools (see e.g. https://github.com/hpc/charliecloud/issues/67 ).
Hi @DrDaveD,
The default before uid_map is set makes every uid map to 65534. Even with that devpts newinstance created from a fake root, pseudo-tty allocation gets Permission denied.
The above is failing with a devpts newinstance due to mypty's call to granpt(). All granpt() doess is perform a chown() and chmod() of the slave pts device. devpts is effectively supposed to do this for you without needing the grantpt() call. grantpt() is attempting to chown() the pts device to the tty group, which will fail when running unprivileged (that is also why gid=tty isn't passed as a devpts mount option when running unprivileged). Some applications like xterm don't care if granpt() fails, or don't call it. xterm gives a warning under CentOS7, but still executes normally. I removed the call to grantpt() in mypty and it functions: [chris@c7vm ~]$ id uid=1000(chris) gid=1000(chris) groups=1000(chris),10(wheel) [chris@c7vm ~]$ unshare -U -r -m --ipc --pid -f [root@c7vm ~]# id uid=0(root) gid=0(root) groups=0(root),65534(nfsnobody) [root@c7vm ~]# mkdir -p ./dev/pts [root@c7vm ~]# mount -tdevpts -orw,nosuid,noexec,newinstance,ptmxmode=0666,mode=0620 devpts ./dev/pts [root@c7vm ~]# ln -sfn /dev/pts/ptmx ./dev/ptmx [root@c7vm ~]# touch ./dev/null [root@c7vm ~]# touch ./dev/tty [root@cl7vm ~]# mount -B /dev/null ./dev/null [root@c7vm ~]# mount -B /dev/tty ./dev/tty [root@c7vm ~]# mount -R ./dev /dev [root@c7vm ~]# ls /dev null ptmx pts tty [root@c7vm ~]# ls /dev/pts ptmx [root@cl7vm ~]# ./mypty total 0 crw-rw-rw- 1 root root 5, 2 Dec 3 11:50 ptmx total 0 crw--w---- 1 root root 136, 0 Dec 3 11:51 0 crw-rw-rw- 1 root root 5, 2 Dec 3 11:50 ptmx The slave side is named : /dev/pts/0 [root@c7vm ~]# echo $DISPLAY :0 [root@c7vm ~]# xterm xterm: cannot load font '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1' xterm: Cannot chown /dev/pts/0 to 0,5: Invalid argument ... but xterm still starts.
I can't test with Singularity 2.6.0 on CentOS 7, because it seems --fakeroot didn't make it into that release.
If the container doesn't have the tty group, as @olifre suggested, it's possible grantpt() won't fail. Another possibility would be to make the tty group gid in the container the same as the user's gid, but that starts getting messy.
You're correct that the uid_map file can only be written to once, so I think the kernel fix will definitely be needed to allow a non-root user in a user namespace to mount devpts on CentOS7. @olifre, feel free to add me as a CC to your RH bugzilla report (you can find my address in the contributors file).
@DrDaveD @hollowec Ok so to resume for --contain
case :
/dev/pts
mount error and mount /dev/pts/X into container instead and without /dev/ptmx because that won't make senseFor missing /dev/console
:
/dev/console
and bind mount with target of /proc/self/fd/0
(if it's a terminal)@DrDaveD After thinking more about it, for terminal allocation with --contain
there is few considerations , while I agree it's necessary for a correct isolation, in case of Singularity that would break some use cases and lead to issues/complexity without involving io stream handler within singularity to handle command pipe by example, or just with the use case you mentioned
Also if one shell spawns multiple singularity images from the same tty
singularity exec -c image prog1 &
singularity exec -c image prog2 &
The first one allocate a terminal and replace the current terminal with the new one, then the second command is executed and override the current terminal (just created by first command) with the new one for the current shell session, and consequently lose the first terminal and its output/error streams. For those isolation use cases it may be safer and flexible to do that from calling program rather, like that you can ask a new terminal for each exec instance and pass slave part to singularity and keep terminal control with master side from your program. So no need to involve extra complexity like passing master file descriptor over unix socket, I think this approach is a lot more flexible and simpler.
The first one allocate a terminal and replace the current terminal with the new one, then the second command is executed and override the current terminal (just created by first command) with the new one for the current shell session, and consequently lose the first terminal and its output/error streams.
No, I'm not talking about replacing the current terminal in the parent process. I am only talking about allocating a new one to use for the container only. Allocating a new one doesn't affect the parent session. Right?
@DrDaveD so you mean allocate a terminal and don't care about stdout/stderr and doesn't interact with ?
No, I mean attach the stdout/stderr of only the container to the new pty. A parent process will need to stay around to read the master side of the pty and send the data back and forth between its stdout/stderr, like in mypty2. I'm not sure if a new process needs to be forked specifically for that or if the existing "Singularity runtime parent" process can take on the additional role. Probably it can take the additional role.
The only way to read from master side is that your program do terminal allocation or if this is Singularity you will need to get master file descriptor via unix socket, so like I said previously, it will be more flexible and simpler to do that on your side rather than asking Singularity to do terminal allocation.
Looking at mypty3 helps with thinking through more of the details. Run "mypty3 bash" as an example. mypty3 takes the unusual approach of leaving the parent's tty setting alone and setting the new pty in raw mode. That's not what we want. When we allocate a new pty, we want the parent's own tty to be put into raw mode, and when the child exits, we restore it. This is a lot like what programs do that use raw mode, such as vi. I made those modifications in a mypty4.c.
Ok, but then the problem is what happens when you run a job in the background. In that case you don't want to set raw mode on the parent's tty, until you're moved to the foreground. Then I believe you need to work with signal handlers. I will look into that but at the moment I need to do some other things.
@cclerget It's much too complicated to expect an application program to do pseudo tty allocation. It has to be done in singularity.
Ok @cclerget, I have done quite a bit of looking into this. First I made one solution mypty5.c where I attempted to make it detect when it was in the background & foreground, and to behave correctly with ttys as it moves back and forth. I got pretty close, but at least with bash in edit mode it ends up echoing an extra command when you move the process from background to foreground.
Instead, I recommend a second solution mypty6.c which only allocates a pseudo-tty if the process starts out in the foreground, and instead allocates pipes to isolate the process that is run from the parent tty. I think that's the best compromise. My intention is for singularity to use logic like this whenever --contain is used. If someone starts out with --contain in the background, it simply won't have access to a tty, which I think is a reasonable compromise. If someone really makes a case for it we could add back something like a docker -t option for this case in the future.
What do you think?
Now I have cleaned up mypty6.c and tested it with various combinations of background, foreground, and redirects. I think it's pretty complete.
After having gone through this exercise, and stepping back and thinking at a higher level, I don't think that having the pty separation and allocation of mypty6.c for --contain is a priority for science grid users, because the vast majority of our production use cases run with no tty. Mostly our reasons for using --contain with a tty is for development and testing where having real separation isn't critical. Perhaps the @olifre case of condor_submit -i is a good production use case, but I'm not sure how that is implemented; I would think condor would have to allocate a pty first before starting singularity, and then it would be a different pty for each job anyway. There may be some other good production use cases in HPC environments.
The bottom line is that what's most important in this issue is the original problem that I reported, that unprivileged singularity breaks on EL7. I think the minimal solution is to take the approach I took with my 2.6.0 underlay patch.
Mostly our reasons for using --contain with a tty is for development and testing where having real separation isn't critical.
When including users developing code at smaller tiers, I would say that separation is critical especially for development and testing of freshly developed code. So yes, this is not "use for production" / Grid jobs, but it's a heavy use case for end users at sites, and the operators offering singularity to their users may insist on the best separation possible. The larger sites with huge throughput do not see much of this, since the creative end users are all piling up at smaller tiers and are being very inventive in breaking things there :wink:. Of course, for tests performed by us admins / the IT crowd, separation is less critical, I agree.
but I'm not sure how that is implemented; I would think condor would have to allocate a pty first before starting singularity, and then it would be a different pty for each job anyway.
I don't believe it allocates any pty right now. It just executes sshd within the container, and sshd (tries) to take care. Running sshd "outside" is something I have suggested, though. An alternative using nsenter
and / or doing something like your approach with mypty was on their developer mind, though, I think.
The bottom line is that what's most important in this issue is the original problem that I reported, that unprivileged singularity breaks on EL7. I think the minimal solution is to take the approach I took with my 2.6.0 underlay patch.
I fully agree! This is definitely the highest priority here.
Version of Singularity:
3.0.1
Expected behavior
Running unprivileged with -c/--contain works
Actual behavior
Fails with fatal error
Steps to reproduce behavior
On a centos 7.x system do:
There has been extensive discussion on related issues on singularity-2.x. See especially #1186 which refers to most of the rest of the discussion and was never fully resolved. Creating /dev/pts on RHEL 7.x is apparently a privileged operation, although it isn't in later fedora kernels.
3.0.1 and the current master branch attempts to implement the same behavior as 2.6.0 but there's a flaw in the conversion to delayed mounts. It checks the return value from system.Points.AddFS(..."devpts"...) for failure and attempts to ignore it, but that call doesn't fail because it just queues the request to be performed later. It only fails later when it actually attempts the mount, and then it is treated as a fatal error. In my 2.6.0 underlay implementation I created a "/dev" space for --contain and immediately mounted the devices inside it, and only delayed the mounting of "/dev". 3.0.x instead defers the mounting of each device separately.
Perhaps the answer is to use my 2.6.0 approach in the new code, but that does not resolve the issue in #1186. What do people think about when using --contain in unprivileged mode from a tty, on all operating systems bind mount /dev/pts/ptmx from the host, allocate a new pseudo tty, and bind mount the /dev/pts/N that is allocated? I don't see a security problem with that. This go code would probably be helpful. The docker run -t option allocates a new pseudo tty on the host and maps it to /dev/console in the container.
Oh, and I just noticed another problem with privileged mode --contain and tty: it does not allocate a new tty, it just makes a new /dev/pts/ptmx and keeps the original tty, but it does not map the original tty (e.g. /dev/pts/0) into the container. That's another flaw, you're supposed to be able to re-open the device under the name given by tty. Also if one shell spawns multiple singularity images from the same tty, they will share the same tty, which does not give the complete isolation desired. I think a new pseudo-tty should be allocated with --contain whether privileged or not privileged.
I notify here the people who were involved in the #1186 discussion: @olifre @hollowec @cclerget