docker / for-win

Bug reports for Docker Desktop for Windows
https://www.docker.com/products/docker#/windows
1.85k stars 287 forks source link

Incorrectly handled clock_gettime64 syscall #8326

Open igorsnunes opened 4 years ago

igorsnunes commented 4 years ago

Expected behavior

Actual behavior

Information

Hi everyone,

While running my application on a i386 debian image (bullseye), I am constantly receiving an "Operation not permitted" for the clock_gettime system call. This error only happens when using a newer version of libc6 (2.31-1). And when this event happens, at some point, my application stops working.

Doing some investigation I figured out that newer versions of glibc, clock_gettime() syscalls falls back to clock_gettime64(). When using "strace" (to scan system calls) on my application, I can see that when clock_gettime64() is called, an EPERM is returned. This specific error code breaks the application. The ploblem with this is: glibc expects a ENOSYS, indicating that this syscall is not implemented by the kernel. If that happens, libc uses another implementation of clock_gettime, returning the correct value; if EPERM is returned instead, libc handles this return value as an error.

I can bypass this issue by running the container with the “—privileged” flag, or creating a seccomp profile that has the following configuration:

"defaultAction": "SCMP_ACT_TRACE"

Which means: return ENOSYS as a default behavior, instead of EPERM.

The –privileged flag bypasses seccomp, and allow every syscall to be handled by the kernel (and apparently, the kernel returns the correct code).

Question: why “clock_gettime64” is not being matched on any seccomp profile (including the default one, used by the engine)? The only way I managed to make this syscall returns ENOSYS, using seccomp profile, was enabling the defaultAction as SCMP_ACT_TRACE. And as far as I can see, this is not a good practice; the correct action would be SCMP_ACT_ERRNO for default cases. See below the two approaches that I tried on my seccomp profile, and didn`t work:

Explicitly allowing clock_gettime64: { "names": ["clock_gettime64"], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": {}, "excludes": {} }

Explicitly setting the behavior of clock_gettime64 to TRACE:

{ "names": ["clock_gettime64"], "action": "SCMP_ACT_TRACE", "args": [], "comment": "", "includes": {}, "excludes": {} }

As shown here https://bugs.launchpad.net/ubuntu/+source/libseccomp/+bug/1868720 , this might be a problem related to older versions of libseccomp installed on the host. Is there a way to get this information from the host linux system used by Docker Desktop?

Please, let me know if I am missing something on my analysis.

Ps: Some documentation used for this analysis: https://gitlab.alpinelinux.org/alpine/aports/-/issues/11774 https://lwn.net/Articles/795128/ https://docs.docker.com/engine/security/seccomp/ https://bugs.launchpad.net/ubuntu/+source/libseccomp/+bug/1868720

Steps to reproduce the behavior

Compile the following code for 32 bits, i.e. "gcc -m32":

#include <stdio.h>
#include <time.h>
#include <fcntl.h>

int main () {
    struct timespec tp;

    if (clock_gettime(CLOCK_REALTIME, &tp) == -1) {
        perror("clock_gettime");
    }
    else {
        printf("clock_gettime success: %ld\n", tp.tv_nsec);
    }
    return 0;
}
  1. Run the binary on a docker image with a newer version of libc on a i386 environment. You can do:

docker run --entrypoint  bash   -v "C:\path_to_bin:/path_to_bin"  -it i386/debian:bullseye
stephen-turner commented 4 years ago

Thanks for the report, @igorsnunes. I found a seemingly relevant patch in version 19.03.9 of the upstream engine: https://github.com/moby/moby/compare/v19.03.8...v19.03.9. But we have engine 19.03.12 in Desktop 2.3.0.4 so maybe there's something else we need to do in our VM. We'll take a look.

djs55 commented 4 years ago

Hi @igorsnunes -- we believe we need to upgrade the version of libseccomp bundled inside Docker Desktop. We currently link libseccomp statically into the dockerd binary. The simplest solution therefore is to bump the version inside the build environment -- which is in progress -- but unfortunately this is the build environment used to build other Linux packages and the change has knock-on effects for other architectures like armhf so it may take a while to fix.

We're also considering switching Docker Desktop to using a dynamically-linked dockerd, which would allow Desktop to bump the libseccomp version in our Dockerfile without worrying about effects on armhf (for now anyway!) However this is quite a big change to our build process too, so will take a while.

In summary

Thanks again for your report. I was hoping there was a quick fix available but unfortunately I've failed to find one.

/lifecycle frozen

raxvan commented 3 years ago

Hello, just for information i found this issue while searching reasons why clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &now_time); returns incorrect time in new_time.tv_nsec. With the same code posted by @igorsnunes (except for CLOCK_PROCESS_CPUTIME_ID) i'm getting more or less random values in tv_nsec. I'm using docker engine with WSL2. Running the image with --privileged solves the issue.

igorsnunes commented 3 years ago

Thanks @djs55 and @stephen-turner . I'll keep following your updates.

microhobby commented 3 years ago

Using a Kernel v5.x the issue does not occurs:

image

These syscalls have been added if I'm not mistaken in Kernel v5.1. So, I hope that the next update of the WSL 2 Kernel, which is planned to use the Kernel v5.4 LTS should solve this.

disconnect3d commented 3 years ago

I am running into the same issue on MacOS with (currently latest) Docker for Desktop 3.2.2: the clock_gettime64 syscall returns EPERM.

...and this can be workarounded with --security-opt seccomp=unconfined so its related to seccomp blocking the syscall. It seems that Docker whitelisted this syscall in their default seccomp policy a year ago, but for some reason this is not used in Docker for Desktop? Why?

Anyway, showing this on the log below (container run with default flags + --cap-add=SYS_PTRACE).

root@72bbc100bb69:/# cat a.c
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <signal.h>
#include <time.h>
#include <fcntl.h>

int main() {
   struct timespec tp;
   syscall(SYS_clock_gettime64, 0, &tp);
}
root@72bbc100bb69:/# gcc -m32 a.c
root@72bbc100bb69:/# strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffc77fd7d50 /* 8 vars */) = 0
strace: [ Process PID=19 runs in 32 bit mode. ]
brk(NULL)                               = 0x58289000
arch_prctl(0x3001 /* ARCH_??? */, 0xffef8c28) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f60000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=21704, ...}) = 0
mmap2(NULL, 21704, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf7f5a000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib32/libc.so.6", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\220\360\1\0004\0\0\0"..., 512) = 512
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\205\327\273-\255\17\201r\321\300,\3\21\240\fF"..., 96, 468) = 96
fstat64(3, {st_mode=S_IFREG|0755, st_size=2002268, ...}) = 0
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\205\327\273-\255\17\201r\321\300,\3\21\240\fF"..., 96, 468) = 96
mmap2(NULL, 2010892, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xf7d6f000
mmap2(0xf7d8c000, 1409024, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d000) = 0xf7d8c000
mmap2(0xf7ee4000, 458752, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0xf7ee4000
mmap2(0xf7f54000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e4000) = 0xf7f54000
mmap2(0xf7f58000, 7948, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf7f58000
close(3)                                = 0
set_thread_area({entry_number=-1, base_addr=0xf7f61100, limit=0x0fffff, seg_32bit=1, contents=0, read_exec_only=0, limit_in_pages=1, seg_not_present=0, useable=1}) = 0 (entry_number=12)
mprotect(0xf7f54000, 8192, PROT_READ)   = 0
mprotect(0x5662b000, 4096, PROT_READ)   = 0
mprotect(0xf7f92000, 4096, PROT_READ)   = 0
munmap(0xf7f5a000, 21704)               = 0
clock_gettime64(CLOCK_REALTIME, 0xffef8c14) = -1 EPERM (Operation not permitted)
exit_group(0)                           = ?
+++ exited with 0 +++
root@72bbc100bb69:/#

With --security-opt seccomp=unconfined (which I don't recommend) it returns ENOSYS (as expected):

clock_gettime64(CLOCK_REALTIME, 0xfffad424) = -1 ENOSYS (Function not implemented)
djs55 commented 3 years ago

I believe this issue is fixed on Docker Desktop 3.3.1 with the newer runc:

Screenshot 2564-04-29 at 09 26 29