Unprivileged --contain fails mounting devpts on centos 7.x

DrDaveD commented 5 years ago

Version of Singularity:

3.0.1

Expected behavior

Running unprivileged with -c/--contain works

Actual behavior

Fails with fatal error

Steps to reproduce behavior

On a centos 7.x system do:

$ singularity exec -u -c sandbox-image bash
FATAL:   container creation failed: mount error: can't mount devpts filesystem to /var/singularity/mnt/session/dev/pts: invalid argument

There has been extensive discussion on related issues on singularity-2.x. See especially #1186 which refers to most of the rest of the discussion and was never fully resolved. Creating /dev/pts on RHEL 7.x is apparently a privileged operation, although it isn't in later fedora kernels.

3.0.1 and the current master branch attempts to implement the same behavior as 2.6.0 but there's a flaw in the conversion to delayed mounts. It checks the return value from system.Points.AddFS(..."devpts"...) for failure and attempts to ignore it, but that call doesn't fail because it just queues the request to be performed later. It only fails later when it actually attempts the mount, and then it is treated as a fatal error. In my 2.6.0 underlay implementation I created a "/dev" space for --contain and immediately mounted the devices inside it, and only delayed the mounting of "/dev". 3.0.x instead defers the mounting of each device separately.

Perhaps the answer is to use my 2.6.0 approach in the new code, but that does not resolve the issue in #1186. What do people think about when using --contain in unprivileged mode from a tty, on all operating systems bind mount /dev/pts/ptmx from the host, allocate a new pseudo tty, and bind mount the /dev/pts/N that is allocated? I don't see a security problem with that. This go code would probably be helpful. The docker run -t option allocates a new pseudo tty on the host and maps it to /dev/console in the container.

Oh, and I just noticed another problem with privileged mode --contain and tty: it does not allocate a new tty, it just makes a new /dev/pts/ptmx and keeps the original tty, but it does not map the original tty (e.g. /dev/pts/0) into the container. That's another flaw, you're supposed to be able to re-open the device under the name given by tty. Also if one shell spawns multiple singularity images from the same tty, they will share the same tty, which does not give the complete isolation desired. I think a new pseudo-tty should be allocated with --contain whether privileged or not privileged.

I notify here the people who were involved in the #1186 discussion: @olifre @hollowec @cclerget

cclerget commented 5 years ago

Hi @DrDaveD , for /dev/pts issue and user namespace, the problem seems to come from newinstance,ptmxmode not allowed within a user namespace. To start a discussion, a possible solution :

remove newinstance,ptmxmode when using user namespace and in this case instead of creating symlink /dev/ptmx -> /dev/pts/ptmx, just bind host /dev/ptmx into container, so terminal allocation will work from container with the counterpart to share indices with host but without exposing other allocated terminals. Thoughts ?

Regarding terminal allocation for each shell/exec/run/start command, I totally agree and noticed that during OCI engine development. That will be fixed and /dev/console bind too.

olifre commented 5 years ago

@cclerget That appears to be the only option. However, seeing that new kernels have support for that ( https://github.com/torvalds/linux/commit/e98d41370392dbc3e94c8802ce4e9eec9efdf92e#diff-6d314687c035fb9da87d65b2b9ffb698L281 ) and that RHEL is not unwilling to backport this to RHEL 7 ( https://bugzilla.redhat.com/show_bug.cgi?id=1522992 ) this should probably only be done as a fallback. If a future RHEL 7 release (and maybe RHEL 8) support mounting devpts without fakeroot, we should take advantage of that (and distros with modern kernels already have that).

For some tools like sshd, there's sadly still https://bugzilla.mindrot.org/show_bug.cgi?id=2813 , so it won't help with those until the tools are adapted. But it would be a start.

DrDaveD commented 5 years ago

I think that binding /dev/ptmx from the host might have the same effect and there's no good reason to try to recreate it or to wait for the RHEL 7 bug fix. It will take a little testing I guess.

olifre commented 5 years ago

I think that binding /dev/ptmx from the host might have the same effect

If it has, I'm for it. It's just curious that a kernel feature exists, so I wonder if we are either missing something (may become visible in testing) or there is a security issue (which I also do not see, though).

cclerget commented 5 years ago

@DrDaveD @olifre After some tests with centos 7.4, devpts mount no matter options passed, always returns invalid argument with user namespace, I tested on Ubuntu with ptmx bind mount and that doesn't work.

hollowec commented 5 years ago

Hi @cclerget, I tried mounting devpts without "newinstance,ptmxmode" as a non-root user in a user namespace on CentOS7: I'm also still not able to mount. I believe the kernel patch @olifre mentioned is necessary for this to work.

@DrDaveD, @olifre, if one bindmounts /dev/ptmx from the host, and the host devpts is not accessible as /dev/pts in the container, any ptys newly allocated by applications in the container won't be visible. For example, running sshd or xterm in the container would fail because these processes wouldn't be able open their slave (/dev/pts/X) devices that devpts creates for them (via /dev/ptmx): the /dev/pts/X device files would not exist. Bind mounting the host /dev/pts would expose all host pty devices, which is undesirable with -C. That is why a new instance of devpts is being mounted in the container.

I agree it would be best to use a new pty when running with -C.

olifre commented 5 years ago

I agree it would be best to use a new pty when running with -C.

Ok! So the only ways to go then would be to use fakeroot mode, i.e. map to uid and gid of 0 in the container, or to wait for the fix, correct? If somebody wants to support my point in the RedHat bugtracker, let me know the mail address of your bugtracker account and I can CC you, then you should be able to access it. I've bumped it with a comment just today, I can let you all know here if there are any updates on it.

cclerget commented 5 years ago

@olifre fakeroot won't help, unshare -U -r -m mount -t devpts devpts /test returns invalid argument on CentOS 7, there is nothing that can be done on Singularity side, the kernel fix is definitively required.

hollowec commented 5 years ago

@cclerget, you need to pass "newinstance" to the unshare mount above. The below works for me on CentOS 7.4: [chris@c7vm ~]$ unshare -U -r -m [root@c7vm ~]# id uid=0(root) gid=0(root) groups=0(root),65534(nfsnobody) [root@c7vm ~]# mount devpts test -tdevpts -omode=0620,newinstance,ptmxmode=0666,noexec,nosuid [root@c7vm ~]# ls test ptmx

Newer kernels got rid of the newinstance option (it is the default in newer kernels, so that parameter is ignored). Mounting a devpts without newinstance in older kernels is somewhat dangerous, as it can overide the mount options for the original devpts mount. I believe that is why mounting devpts without that option isn't permitted for in a user/mount namespace.

@olifre also confirmed that --fakeroot works in #1186. I can't seem to find the --fakeroot option in 2.6.0... I guess it was never merged into master, or was removed?

hollowec commented 5 years ago

@DrDaveD, @olifre, @cclerget,

There are two somewhat separate issues in this ticket:

devpts cannot be mounted by an unprivileged user in a user namespace on CentOS 7, and Singularity 3.x no longer ignores that failure: Allowing unprivileged mounts of devpts cannot be fixed without the kernel patch @olifre mentioned. However, one possibility to allow devtps to be mounted without this patch would be for Singularity to start out as the uid 0 user in the user/mount namespace, and change to the user uid after mounting devpts. I'm not certain if this is possible, or how difficult it would be to implement. The original 2.6.0 devtpts mount behavior should probably be replicated (ignore EINVAL) for now, until this can be implemented, or the CentOS7 kernel is patched.
You are requesting that Singularity create a new pty for the shell/process it executes in the container, when invoked with -C: The ability to mount devpts in a container when using -C was added to address #857. The goal was to allow applications that allocate their own pty devices (such as sshd in #857, in order to support "condor_submit -interactive") to function. It should be possible to use the devpts mount in the container to support your requested functionality. Having Singularity create a new pty for the process it executes would require modifying Singularity to handle the I/O to/from that pty, which may be somewhat involved. If devpts can not be mounted in CentOS 7 when user namespaces are being used, and isolated stdin/stdout/stderr pty functionality is still desired, I agree it may be possible to just allocate a pty in the host and bindmount it into the container in that specific case. Assuming Singularity opens /dev/ptmx on the host before changing mount namespaces, it should not be necessary to bind mount that device in the container, just /dev/pts/X, in order to have an isolated pty for the process Singularity executes.

DrDaveD commented 5 years ago

@hollowec thanks for the good summary of where we're at.

Just to make doubly-sure that bind-mounting /dev/pts/ptmx from the host won't work, I made a quick hack to a 2.6.0-with-underlay that created that bind mount point and then ran it with -u -c --ipc --pid -B /dev/pts/ptmx. Then I attempted to allocate a pty with mypty and it not only didn't show the device in /dev/pts, it failed with a Permission denied error. /dev/pts/ptmx became owned by the nobody user and group (65534). So that's for sure out.

I also looked at the 3.x master code for where the namespace uid/gid mapping is done and found that /proc/$$/uid_map is set up immediately after the call to unshare(CLONE_NEWUSER), in the same function, which is actually in low-level C (cmd/starter/c/starter.c). The user_namespaces man page says that uid_map may only be written to once. So singularity would have to go through quite a heroic & messy effort to support pseudo-tty allocation with unprivileged user namespaces on RHEL 7.x before RHEL backports the improvement, in order to set up this one device before mapping the uid.

This isn't a show-stopper for widespread use of unprivileged singularity on RHEL 7.x with htcondor, is it? If I understand correctly, it means that condor_ssh_to_job won't work, but I don't know how vital that is.

Assuming it is not a show-stopper, in summary what we need from singularity-3 for this issue is to change it to not die when devpts cannot be set up and to allocate one pseudo-tty from the host for the container user when singularity is run from a tty.

DrDaveD commented 5 years ago

In fact, the "heroic effort" I talked about isn't even possible. unshare -r option sets up the uid_map so that your unprivileged user id is mapped to zero, but uid_map may only be written to once so it can't be changed again. The default before uid_map is set makes every uid map to 65534. Even with that devpts newinstance created from a fake root, pseudo-tty allocation gets Permission denied.

A single pseudo-tty allocation from the host is probably not even possible as an unprivileged user after unshare has been run. Probably the allocation needs to be done before unshare.

hollowec commented 5 years ago

Hi @DrDaveD,

As you mentioned, lack of pty allocation capability is not a showstopper for most people using HTCondor. It only affects those using "condor_submit -interactive", or condor_ssh_to_job, as you mentioned.

Is ptmxmode=0666 set in your flags for the new instance mount of devpts?

hollowec commented 5 years ago

Also, /dev/ptmx and /dev/pts/ptmx on the CentOS host system are effectively be the same device, but /dev/ptmx will have different permissions. You can try bindmounting /dev/ptmx in directly instead of /dev/pts/ptmx. Most applications use /dev/ptmx (not /dev/pts/ptmx) when allocating a pty.

DrDaveD commented 5 years ago

Yes, I used the same flags as you except made the mount point /dev/pts. mypty still gave Permission denied.

I tried again just now and noticed that /dev/ptmx was still owned by 65534 even though /dev/pts/ptmx wasn't. So I did mount --bind /dev/pts/ptmx /dev/ptmx to eliminate that. mypty still gave Permission denied.

I didn't know about the different permissions of /dev/ptmx and /dev/pts/ptmx, but singularity symlinks /dev/pts/ptmx to /dev/ptmx and I had tried temporarily making /dev/pts/ptmx mode 666 when testing bind mounting to the hacked singularity. So now I also tried as you suggested to use -B /dev/ptmx with my hacked singularity-2.6.0 (with the bind mount point there instead of at /dev/pts/ptmx). This time it gave a error 2 (No such file or directory) instead of 13 (Permission denied). (Now I'm not sure if I might have gotten that error before and not noticed that the number had changed from 13). With strace I can see it opening /dev/ptmx successfully but then it fails after a call to statfs. The posix_openpt source code shows it checking to see if it is a devpts type filesystem and otherwise returning ENOENT, so that's what happened. If I change it to just open /dev/ptmx directly, then it fails on the grantpt with error 2 because it can't open the new /dev/pts/N device. So it is just as you said it would be: it's not usable inside the container because of the missing /dev/pts devices.

cclerget commented 5 years ago

Thanks @hollowec ! Ok so that work as expected with fake root within user namespace : singularity shell -c -u --fakeroot dir, I should have tried this first :/ ...

olifre commented 5 years ago

This isn't a show-stopper for widespread use of unprivileged singularity on RHEL 7.x with htcondor, is it?

It is for us, and probably other sites with many local users. My reply in the following is a bit off-topic, but since you raised the point, let me explain why. Anybody not interested, please ignore the rest of this comment ;-).

One huge advantage of containerisation is that users can choose their environment, and one large advantage of workload management systems is that they take care of scheduling jobs to a large set of heterogeneous resources. So the way to go is to combine both: Expose heterogeneous resources with containers to users, ideally granting them the very same environment to develop their code and to actually run the jobs. The only clean way to do that is to give them access to the very same environment (=containers) for both development and execution. Many users can't be (and should not be!) bothered with setting up their own container, or even choosing it. They want to choose the "HEP-ready" runtime environment matching their usecase, which will usually be SL6 with HEPOSlibs, or CentOS7 with HEPOSlibs. So what we do at our site is offer them a classad attribute in their HTCondor Job Description file, with which they can choose between those. They can then either submit an interactive job (to do development, including graphical applications) or a batch job, and get the very same environment for both usecases. With an interactive job, they can access large nodes with 128 GB of memory and 56 cores, if needed.

Furthermore, we have users from other fields (theory, applied physics, optics, condensed matter, hadron physics, numerics...) who may want to use something more modern instead of those legacy OSes. They can choose Ubuntu 18.04, which matches what we give them on their centrally managed desktop machines.

These heterogeneous usecases are very common at T2/T3 sites, University or institute clusters. For all these, there are three ways to go:

The classic way: Have bare metal / virtual "login nodes" for each environment. It's hard, though, to get them to be the very same as the container environments. They are also often misused for long-running jobs and cause headaches for site admins.
Auto-invoke Singularity when users ssh into the login nodes. I think this has been under discussion at least for LIGO users. That's still not perfect in my opinion, since you still have "login nodes".
Just leverage the same infrastructure you use for the jobs and let the workload management system handle things. That's the way we chose, but it requires interactive jobs to work.

So yes, for us it is a show-stopper. But it's not as bad as it seems, for several reasons:

The fix is in the upstream kernel, and I am pretty sure RHEL 8 will have it. We can update to that as soon as it comes out, since everything users see is inside containers. So ideally, they won't even notice if we perform the upgrade, for example next year.
I talked to the HTCondor devs at the last European workshop. They agreed what's currently done there is non-ideal. The plan is to:
- Have OCI support.This allows to use container runtimes fully supported by a larger community and even available out of the box with full support on RHEL 8 (runc). It will also help to not have separate code paths for every single implementation, as is the case now.
- Have sshd run outside of the container. While I love that there is a working singularity support layer right now in HTCondor contributed by @bbockelm , it has it's flaws due to the way HTCondor does things right now. This can only fixed by restructuring in HTCondor, so that will take some time. Ideally, sshd will run outside and talk to the inside of the container by joining the namespaces via nsenter.

So what do we do right now? We run privileged on a platform which fully supports user namespaces, to support running sshd in the containers. That's a risk we have to take to support a modern setup with many local users. A larger site (T2, T1) can probably afford to go another way, since they are running more homogeneous grid jobs, and they just push local users out to institute or University clusters, or to T3s. Smaller sites with a large variety of users and running HTCondor can profit hugely from using interactive jobs with containers.

So in short, I think this feature is critical, but there are several ways to solve it - and CentOS 8 is not too far away. I'm still anxious about the OpenSSH guys not responding anymore, but maybe HTCondor will surprise me and implement something sooner than expected.

olifre commented 5 years ago

@DrDaveD : And thanks for supporting my point in the RHEL bugtracker, that's much appreciated!

olifre commented 5 years ago

The discussion at https://bugzilla.mindrot.org/show_bug.cgi?id=2813 has got some drive again. A workaround has been found: Not having the group tty in the /etc/group file mapped into the container prevents sshd from enforcing that the pts devices are owned by the tty group. I am unsure if this workaround would also help with other tools (see e.g. https://github.com/hpc/charliecloud/issues/67 ).

hollowec commented 5 years ago

Hi @DrDaveD,

The default before uid_map is set makes every uid map to 65534. Even with that devpts newinstance created from a fake root, pseudo-tty allocation gets Permission denied.

The above is failing with a devpts newinstance due to mypty's call to granpt(). All granpt() doess is perform a chown() and chmod() of the slave pts device. devpts is effectively supposed to do this for you without needing the grantpt() call. grantpt() is attempting to chown() the pts device to the tty group, which will fail when running unprivileged (that is also why gid=tty isn't passed as a devpts mount option when running unprivileged). Some applications like xterm don't care if granpt() fails, or don't call it. xterm gives a warning under CentOS7, but still executes normally. I removed the call to grantpt() in mypty and it functions: [chris@c7vm ~]$ id uid=1000(chris) gid=1000(chris) groups=1000(chris),10(wheel) [chris@c7vm ~]$ unshare -U -r -m --ipc --pid -f [root@c7vm ~]# id uid=0(root) gid=0(root) groups=0(root),65534(nfsnobody) [root@c7vm ~]# mkdir -p ./dev/pts [root@c7vm ~]# mount -tdevpts -orw,nosuid,noexec,newinstance,ptmxmode=0666,mode=0620 devpts ./dev/pts [root@c7vm ~]# ln -sfn /dev/pts/ptmx ./dev/ptmx [root@c7vm ~]# touch ./dev/null [root@c7vm ~]# touch ./dev/tty [root@cl7vm ~]# mount -B /dev/null ./dev/null [root@c7vm ~]# mount -B /dev/tty ./dev/tty [root@c7vm ~]# mount -R ./dev /dev [root@c7vm ~]# ls /dev null ptmx pts tty [root@c7vm ~]# ls /dev/pts ptmx [root@cl7vm ~]# ./mypty total 0 crw-rw-rw- 1 root root 5, 2 Dec 3 11:50 ptmx total 0 crw--w---- 1 root root 136, 0 Dec 3 11:51 0 crw-rw-rw- 1 root root 5, 2 Dec 3 11:50 ptmx The slave side is named : /dev/pts/0 [root@c7vm ~]# echo $DISPLAY :0 [root@c7vm ~]# xterm xterm: cannot load font '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1' xterm: Cannot chown /dev/pts/0 to 0,5: Invalid argument ... but xterm still starts.

I can't test with Singularity 2.6.0 on CentOS 7, because it seems --fakeroot didn't make it into that release.

If the container doesn't have the tty group, as @olifre suggested, it's possible grantpt() won't fail. Another possibility would be to make the tty group gid in the container the same as the user's gid, but that starts getting messy.

You're correct that the uid_map file can only be written to once, so I think the kernel fix will definitely be needed to allow a non-root user in a user namespace to mount devpts on CentOS7. @olifre, feel free to add me as a CC to your RH bugzilla report (you can find my address in the contributors file).

cclerget commented 5 years ago

@DrDaveD @hollowec Ok so to resume for --contain case :

for buggy user namespace, ignore /dev/pts mount error and mount /dev/pts/X into container instead and without /dev/ptmx because that won't make sense

For missing /dev/console :

add /dev/console and bind mount with target of /proc/self/fd/0 (if it's a terminal)

@DrDaveD After thinking more about it, for terminal allocation with --contain there is few considerations , while I agree it's necessary for a correct isolation, in case of Singularity that would break some use cases and lead to issues/complexity without involving io stream handler within singularity to handle command pipe by example, or just with the use case you mentioned

Also if one shell spawns multiple singularity images from the same tty

singularity exec -c image prog1 &
singularity exec -c image prog2 &

The first one allocate a terminal and replace the current terminal with the new one, then the second command is executed and override the current terminal (just created by first command) with the new one for the current shell session, and consequently lose the first terminal and its output/error streams. For those isolation use cases it may be safer and flexible to do that from calling program rather, like that you can ask a new terminal for each exec instance and pass slave part to singularity and keep terminal control with master side from your program. So no need to involve extra complexity like passing master file descriptor over unix socket, I think this approach is a lot more flexible and simpler.

DrDaveD commented 5 years ago

The first one allocate a terminal and replace the current terminal with the new one, then the second command is executed and override the current terminal (just created by first command) with the new one for the current shell session, and consequently lose the first terminal and its output/error streams.

No, I'm not talking about replacing the current terminal in the parent process. I am only talking about allocating a new one to use for the container only. Allocating a new one doesn't affect the parent session. Right?

cclerget commented 5 years ago

@DrDaveD so you mean allocate a terminal and don't care about stdout/stderr and doesn't interact with ?

DrDaveD commented 5 years ago

No, I mean attach the stdout/stderr of only the container to the new pty. A parent process will need to stay around to read the master side of the pty and send the data back and forth between its stdout/stderr, like in mypty2. I'm not sure if a new process needs to be forked specifically for that or if the existing "Singularity runtime parent" process can take on the additional role. Probably it can take the additional role.

cclerget commented 5 years ago

The only way to read from master side is that your program do terminal allocation or if this is Singularity you will need to get master file descriptor via unix socket, so like I said previously, it will be more flexible and simpler to do that on your side rather than asking Singularity to do terminal allocation.

DrDaveD commented 5 years ago

Looking at mypty3 helps with thinking through more of the details. Run "mypty3 bash" as an example. mypty3 takes the unusual approach of leaving the parent's tty setting alone and setting the new pty in raw mode. That's not what we want. When we allocate a new pty, we want the parent's own tty to be put into raw mode, and when the child exits, we restore it. This is a lot like what programs do that use raw mode, such as vi. I made those modifications in a mypty4.c.

Ok, but then the problem is what happens when you run a job in the background. In that case you don't want to set raw mode on the parent's tty, until you're moved to the foreground. Then I believe you need to work with signal handlers. I will look into that but at the moment I need to do some other things.

DrDaveD commented 5 years ago

@cclerget It's much too complicated to expect an application program to do pseudo tty allocation. It has to be done in singularity.

DrDaveD commented 5 years ago

Ok @cclerget, I have done quite a bit of looking into this. First I made one solution mypty5.c where I attempted to make it detect when it was in the background & foreground, and to behave correctly with ttys as it moves back and forth. I got pretty close, but at least with bash in edit mode it ends up echoing an extra command when you move the process from background to foreground.

Instead, I recommend a second solution mypty6.c which only allocates a pseudo-tty if the process starts out in the foreground, and instead allocates pipes to isolate the process that is run from the parent tty. I think that's the best compromise. My intention is for singularity to use logic like this whenever --contain is used. If someone starts out with --contain in the background, it simply won't have access to a tty, which I think is a reasonable compromise. If someone really makes a case for it we could add back something like a docker -t option for this case in the future.

What do you think?

DrDaveD commented 5 years ago

Now I have cleaned up mypty6.c and tested it with various combinations of background, foreground, and redirects. I think it's pretty complete.

DrDaveD commented 5 years ago

After having gone through this exercise, and stepping back and thinking at a higher level, I don't think that having the pty separation and allocation of mypty6.c for --contain is a priority for science grid users, because the vast majority of our production use cases run with no tty. Mostly our reasons for using --contain with a tty is for development and testing where having real separation isn't critical. Perhaps the @olifre case of condor_submit -i is a good production use case, but I'm not sure how that is implemented; I would think condor would have to allocate a pty first before starting singularity, and then it would be a different pty for each job anyway. There may be some other good production use cases in HPC environments.

The bottom line is that what's most important in this issue is the original problem that I reported, that unprivileged singularity breaks on EL7. I think the minimal solution is to take the approach I took with my 2.6.0 underlay patch.

olifre commented 5 years ago

Mostly our reasons for using --contain with a tty is for development and testing where having real separation isn't critical.

When including users developing code at smaller tiers, I would say that separation is critical especially for development and testing of freshly developed code. So yes, this is not "use for production" / Grid jobs, but it's a heavy use case for end users at sites, and the operators offering singularity to their users may insist on the best separation possible. The larger sites with huge throughput do not see much of this, since the creative end users are all piling up at smaller tiers and are being very inventive in breaking things there :wink:. Of course, for tests performed by us admins / the IT crowd, separation is less critical, I agree.

but I'm not sure how that is implemented; I would think condor would have to allocate a pty first before starting singularity, and then it would be a different pty for each job anyway.

I don't believe it allocates any pty right now. It just executes sshd within the container, and sshd (tries) to take care. Running sshd "outside" is something I have suggested, though. An alternative using nsenter and / or doing something like your approach with mypty was on their developer mind, though, I think.

The bottom line is that what's most important in this issue is the original problem that I reported, that unprivileged singularity breaks on EL7. I think the minimal solution is to take the approach I took with my 2.6.0 underlay patch.

I fully agree! This is definitely the highest priority here.

apptainer / singularity