docker / for-mac

Bug reports for Docker Desktop for Mac
https://www.docker.com/products/docker#/mac
2.43k stars 117 forks source link

Bad performance of PHP(+xdebug) after sleep and waking up because of clock changing from TSC to HPET #3455

Closed anton-siardziuk closed 2 years ago

anton-siardziuk commented 5 years ago

Expected behavior

Test scenario should always take 30s to complete

Actual behavior

Test scenario takes 30s to complete after Docker for Mac restart, and takes 1m30s-2m to complete after sleep and wakeup.

Information

I have some PHP test scenario for my application I can run to check how good or bad performance is. I also have a linux laptop to compare with. This test in docker on linux completes in 30s. When I restart Docker for Mac, it runs 30s on Docker for Mac as well, but after I put MacBook to sleep and wake it up, the very same test takes 1m30s-2m. And it would take this long until I restart Docker for Mac. After restart, this test takes 30s again. I can reproduce it very reliably every time I try. The very same issue was reported a year ago here: https://github.com/EugenMayer/docker-sync/issues/524

Diagnostic logs

Docker for Mac: Version 2.0.0.0-mac81 (29211)

Steps to reproduce the behavior

  1. Run something with fixed run time, see time X
  2. Put Mac to sleep, wake it up
  3. Run it again, see time 3X
  4. Restart Docker for Mac
  5. Run it again, see time X
anton-siardziuk commented 5 years ago

I started working on a way to easily reproduce this issue outside of my project. My first call was to try to run some C compilation so I created docker-compose.yml file like this:

version: '3.3'
services:
  benchmark:
    image: php:7.2.14-cli-stretch
    command: docker-php-ext-install json pdo pdo_mysql

And run it like that

time docker-compose up

Where docker-php-ext-install basically compiles those extensions. With no luck: compile time is the same before and after sleep (~25s). So it is not CPU who works slower. Will try to investigate further. My next idea is to run some strace and compare results from fast and slow tests, but it might take some time to dig through these results. Any ideas how to pin it down are welcome.

anton-siardziuk commented 5 years ago

More info on this problem. Ran a test php script through strace and the difference between good and bad performance straces is millions (literally) calls of gettimeofday when performance is bad. I also found this clock related logs:

2019-01-13 13:54:39.119934+0300  localhost com.docker.driver.amd64-linux[35411]: proxy << POST /v1.39/exec/a177f7c2bbbd2fc5816387e002da09680c89d4b828e59dc7627513a9e4e4b737/resize?
h=41&w=179 (181.771536ms)
2019-01-13 13:56:40.584949+0300  localhost com.docker.vpnkit[35410]: TCP 192.168.65.3:53910 > 212.211.132.250:80 closing flow due to idle port disconnection
2019-01-13 13:56:40.585113+0300  localhost com.docker.vpnkit[35410]: TCP 192.168.65.3:33558 > 212.211.132.250:80 closing flow due to idle port disconnection
2019-01-13 14:01:47.719443+0300  localhost kernel[0]: PM notification timeout (pid 35411, com.docker.drive)
2019-01-13 14:01:47.719449+0300  localhost kernel[0]: PM notification timeout (pid 35411, com.docker.drive)
2019-01-13 14:06:09.957178+0300  localhost com.docker.hyperkit[35416]: [ 1616.560466] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
2019-01-13 14:06:09.964637+0300  localhost com.docker.hyperkit[35416]: [ 1616.579018] clocksource:                       'hpet' wd_now: 3bd42798 wd_last: 3ad23599 mask: ffffffff
2019-01-13 14:06:09.967614+0300  localhost com.docker.hyperkit[35416]: [ 1616.581360] clocksource:                       'tsc' cs_now: bf689879560 cs_last: b761a0f445c mask: ffffffffffffffff
2019-01-13 14:06:09.972656+0300  localhost com.docker.hyperkit[35416]: [ 1616.587694] Hangcheck: hangcheck value past margin!
2019-01-13 14:06:09.990049+0300  localhost com.docker.hyperkit[35416]: [ 1616.602045] clocksource: Switched to clocksource hpet
2019-01-13 14:06:09.997985+0300  localhost com.docker.driver.amd64-linux[35411]: Requesting time sync
2019-01-13 14:06:10.420056+0300  localhost com.docker.driver.amd64-linux[35411]: cannot read sysevent: EOF
anton-siardziuk commented 5 years ago

So after a lot of gdb and strace and google, I think I understand what is going on. Linux on wakeup thinks that super fast but not that reliable TSC clock no longer up to date and switches to super reliable but not that fast HPET clock. First one works without any syscalls - in userspace, but HPET works though syscall, hence millions of gettimeofday sys calls when performance is bad. And PHP applications in development mode generate millions calls of current time (because of how debugger in PHP works). That's why this clock switch is very noticeable - everything is 3 times slower after wakeup.

barat commented 5 years ago

Nice catch - I have similar issues with big Symfony (+ES) project. Until sleep - everything seems to be OK, after wakeup - 10x degraded performance. I Use docker-sync to eliminate bind mount issues (thinking about mutagen tho), but this thing will happen even without any mounts. Hope, that someone from dev team will take a closer look at it. Now, Docker Desktop is really problematic software to work it - requires workarounds, constant restarts, has timedrifts ...

ThHareau commented 5 years ago

I'm having this issue also, on fedora since a while actually (almost 6 months).

remcoverton commented 5 years ago

We are having exactly the same issue and it's blocking us from migrating local development to a docker setup. We are on the latest stable version: 2.0.0.3 (31259), Mac 10.14.3.

For testing purposes: I have created a small container that runs the following https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/ example c script:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

int
main(int argc, char *argv[])
{
        struct timeval tv;
        int i = 0;
        for (; i<100; i++) {
                gettimeofday(&tv,NULL);
        }

        return 0;
}
#!/bin/bash

while true; do
    strace -ce gettimeofday /gettimeofdaytest/test
    sleep 10
done

After closing the Mac and waiting a minute or 2 the script starts outputting trace information indicating the calls are now system calls. The source of the issue seems to also be that unstable TSC Clocksource.

linuxkit-025000000001:~# echo 'tsc' > /sys/devices/system/clocksource/clocksourc e0/current_clocksource [ 4483.670063] clocksource: Override clocksource tsc is unstable and not HRT compatible - cannot switch while in HRT/NOHZ mode

When I restart docker desktop the problem is gone again.

tonivdv commented 5 years ago

Was going to create a new support ticket until I saw this one. I exactly have the same issue, and also with a php project with a lot of mounted files. Only a docker for mac restart fixes the issue! Knowing that most mac user put their mac in sleep mode this will be very annoying.

Hope you can find a solution to this.

traed commented 5 years ago

Same here. Hoping for a quick fix!

dchassin commented 5 years ago

@m00t: kudos to you. We are experiencing the same problem with gridlabd. A complete validation run normally takes about 400 seconds. After sleep it typically takes around 2000 seconds. Your conclusion is entirely consistent with our findings. After a lot of playing around with linux builds on different VM platforms we decided it had to be in the docker system itself. For example, after the same sleep interval (overnight) centos:7 docker is slow, but a VMware Centos 7 is ok.

The only solution we have now is to restart the docker daemon.

moacode commented 5 years ago

I've noticed this exact behavior, although I've only just had time to investigate. Typically we run a stack of PHP, MySQL and Apache proxied through Nginx and in my test (which is based off a profiler read-out), I'm using CraftCMS with PHP 7.2.

This is a screenshot of the profiler with Docker daemon restarted AND the container stack freshly recreated:

Screen Shot 2019-04-27 at 11 31 45 PM

~1.5s to load the page (I am using Mutagen for keeping files in sync).

After putting my computer to sleep and waking it up again, the same page reload yielded the following results:

Screen Shot 2019-04-27 at 10 12 40 PM

~6 seconds!?!

Finally, after restarting Docker I was getting the following output:

Screen Shot 2019-04-27 at 10 16 05 PM

~3 seconds... Still not at the original 1.5 seconds.

It wasn't until I restarted the docker daemon and re-created my containers that the page load time came back down to 1.5 seconds (Mutagen was left running the entire time).

remcoverton commented 5 years ago

I have updated to the latest docker ( edge ) version and the problem still persists in exactly the same manner as described above.

2.0.4.0

remcoverton commented 5 years ago

Tested on the latest docker Edge version, it still falls back to hpet.

2.0.5.0

docker-desktop:~# [ 450.623931] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc' as unstable because the skew is too large: [ 450.627869] clocksource: 'hpet' wd_now: cf3bfb7 wd_last: c111b9b mask: ffffffff [ 450.630547] clocksource: 'tsc' cs_now: 21d88179225 cs_last: 21c9504986d mask: ffffffffffffffff [ 450.634050] clocksource: Switched to clocksource hpet docker-desktop:~# cat /sys/devices/system/clocksource/clocksource0/current_clock source hpet

tomasinchoo commented 5 years ago

Until this issue is resolved in stable version, it is impossible to use docker on Mac.

dchassin commented 5 years ago

We’ve been able to use docker on macs as long as we don’t let the Mac sleep. If it sleeps we have to restart docker.

tomasinchoo commented 5 years ago

We’ve been able to use docker on macs as long as we don’t let the Mac sleep. If it sleeps we have to restart docker.

Yea, I know, but that is not much of a solution, especially on Mac which sleeps a lot and has limited charges. I don't even use Mac, but we have a team that does and I really can't even understand why. It is so limited and weak. Unfortunately I'm the one handling docker images for them. :(

dchassin commented 5 years ago

Indeed. It’s been a real source of frustration for most of us because we use docker alot on laptops. Sleep is very common even when running long jobs. It has to be fixed soon.

smaudet commented 5 years ago

Slightly different problem but I think it might be a problem with OSX itself...

Running VBox for a linux VM (so similar situation to Docker) - on OSX sleep and then resume, the whole computer is unusable, requiring a restart...

This is some shoddy OSX resume/restart logic I think. Anyway to downgrade OSX to a version which worked?

tonivdv commented 5 years ago

@smaudet Never had that issue with vbox. I used for a very long time docker toolbox with nfs enabled and I never had the issue described here. So I doubt it's related.

dchassin commented 5 years ago

Slightly different problem but I think it might be a problem with OSX itself...

If I had to guess, it's an interaction between the three systems, and not something that is exclusively the fault of any one system. It works fine on VMware Linux on OSX, but it doesn't work on Docker Linux on OSX, where Linux and OSX are the same versions for both cases. Perhaps when OSX wakes from a sleep, Linux isn't being told what happened and it doesn't reset the clock stability check properly on Docker, whereas on VMware it is. Since VMware can do it but Docker can't, that suggests OSX isn't to blame alone for this problem.

smaudet commented 5 years ago

@dchassin the curious thing is there isn't an issue on the windows or linux host side of things... perhaps it is partly related to the software? I.e. VMWare catches the sleep event while Docker and VBox don't...

Could be hardware too, I've not ruled that out yet (would try to BootCamp windows or linux host and see if the problem persists).

carlos-granados commented 5 years ago

I have created an issue in the Xdebug repo asking for a way to disable this call to getTimeOfDay as this is just used to timestamp stack entries and I think that 99% of the applications that use Xdebug don't need this info

joniklas commented 5 years ago

Thanks @carlos-granados for your initiative. This could decrease the symptoms of the slow clocksource problem.

Since getTimeOfDay is used in many other cases than debugging: It would be great, if this issue on Macs could be solved at the core - so inside docker. (Also there are other low level calls wich are slow in debugging. E.g. getPid.)

derickr commented 5 years ago

Is there a different clock source that can be used that does not have this problem?

dochne commented 5 years ago

Is there a different clock source that can be used that does not have this problem?

There are three clock sources available most of the time. tsc, hpet and acpi_pm. If TSC desyncs (causing this issue) then we're left with only hpet and acpi_pm.

I'll try to get around to running some benchmarks with acpi instead, but from my reading about different clock sources I suspect anything non-TSC will end up with the same issue.

As a side note, here's a couple of helpful bash scripts that we've been using internally lately to at least increase visibility of the issue, which may or may not be useful to some others experiencing this issue.

Verifying if this is currently effecting your containers:

YELLOW="\033[0;33m"
RED="\033[31m"
RESET="\033[0m"
DOCKER_CLOCK_SOURCE=$(docker exec -ti {my_docker_container} /bin/bash -c 'cat /sys/devices/system/clocksource/clocksource0/current_clocksource | tr -d "[:space:]"')
if [[ ${DOCKER_CLOCK_SOURCE} != "tsc" ]]; then
    echo -e "${RED}WARNING!${RESET} - ${YELLOW}Clock drift detected. Please restart Docker Desktop to avoid severe performance issues${RESET}"
fi

Automatically killing and restarting Docker for Desktop via CLI:

echo "Killing Docker"
osascript -e 'quit app "Docker"';
echo "Restarting Docker"
open -a Docker
echo "Waiting for Docker to start"
while ! docker system info > /dev/null 2>&1; do printf "."; sleep 1; done; printf "\n";
dsazup commented 5 years ago

I'm currently running into the same issue. Test suite which should run for 2 mins max, runs 11 minutes after sleep. After restarting, it's back to 2 minutes.. 🤔

dchassin commented 5 years ago

Sometime during the last week Centos:7 slowed down significantly when running gridlabd validation tests, which uses time calls a lot. I wonder if it has anything to do with the change to the clock calls.

herregroen commented 4 years ago

Also suffering from this issue.

I've created a script inspired by @Dolondro's scripts that runs a polling check alongside docker-compose up to detect clock changes which automatically reboots docker and your containers when that happens.

CONTAINER_NAME="my_container"

trap stop_docker INT
function stop_docker {
    wait $PROCESS
    exit
}

echo "Ensuring all containers are built"
docker-compose up --no-start

echo "Booting containers"
docker-compose up &
PROCESS=$!
sleep 5

while true; do
    CLOCK_SOURCE=`docker exec -ti $CONTAINER_NAME /bin/bash -c 'cat /sys/devices/system/clocksource/clocksource0/current_clocksource'| tr -d '[:space:]'`
    if [ "$CLOCK_SOURCE" != "tsc" ]; then
        echo "Restarting docker now to fix out-of-sync hardware clock!"
        docker ps -q | xargs -L1 docker stop
        osascript -e 'quit app "Docker"'
        open --background -a Docker
        echo "Giving docker time to start..."
        until docker info 2> /dev/null 1> /dev/null; do
            sleep 1
        done
        echo "Docker is up and running again! Booting containers!"
        docker-compose up &
        PROCESS=$!
    fi
    sleep 5
done
ralphschindler commented 4 years ago

Bump, same issue, really hoping someone can find a solution that is not to restart Docker after waking up and seeing the clock is then set to use hpet :(

iBobik commented 4 years ago

Also with version 2.1.0.5.

Diagnostics ID: 959AB9DD-E4E2-43B6-94C6-2C4058050B79/20200215151633

stevendesu commented 4 years ago

Are you sure this is just a Docker issue? I've been having bad PHP performance issues on my Mac, whether running in Docker or running on my host machine. After spending a day experimenting and researching to try to find the cause of it, I stumbled upon this thread.

I haven't used gdb to profile PHP yet, but I have experimentally found that including a file (via a simple include("filename.php"); takes 0.0001 seconds normally and about 0.003 seconds (30x longer!) when PHP is going slow.

After seeing this thread, I tried out the same reproduction steps:

This could be a PHP issue? I'll be honest, I don't fully understand what tsc vs hpet means, but I think this issue (or at least a remotely related one) extends beyond Docker

ralphschindler commented 4 years ago

@stevendesu blame can be spread around- apple, docker, and to a very much lesser degree Xdebug itself (maybe gettimeofday() isn't really necessary during debug?, that is a matter of opinion).

https://joshbutts.com/posts/patching-xdebug-docker-for-mac/

remcoverton commented 4 years ago

@stevendesu blame can be spread around- apple, docker, and to a very much lesser degree Xdebug itself (maybe gettimeofday() isn't really necessary during debug?, that is a matter of opinion).

https://joshbutts.com/posts/patching-xdebug-docker-for-mac/

Hi Ralph,

It is not just x-debug which is effected; to a lesser degree also new relic for example is effected. It really depends on what type of application one is running if its noticeable of just measurable, in our case we are running a magento 1 shop and well, you feel it.

dchassin commented 4 years ago

As an experiment I eliminated all the calls to ftime() in gridlabd but it didn't make much difference. Do anyone know whether there are other system calls that are affected by this problem?

ralphschindler commented 4 years ago

This is probably a good source of information to track down which sys calls that are affected: https://0xax.gitbooks.io/linux-insides/Timers/linux-timers-7.html

stevendesu commented 4 years ago

I personally noticed that this problem affected the performance of include / require / require_once calls, specifically when relying on the resolver (if a full path was not provided). So when I did something like require "\ThirdPartyLib\Lib.php"; to pull in a Composer dependency.

My guess is it was using fstat or some similar system call to check for file existence (iteratively checking each directory in the include path) and this call was affected

chadmccune commented 4 years ago

I'm having what I think is the same problem on Windows 10, and it's not just containers that are slow. The DockerDesktop UI as well

derickr commented 4 years ago

(At least) libc has another call to get to timing information, "clock()", and "clock_gettime()". Although this isn't wall time, but CPU time, I was wondering whether it would have the same problem as "gettimeofday()"?

docker-robott commented 4 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

dochne commented 4 years ago

/lifecycle frozen

dchassin commented 4 years ago

Can anyone confirm that this issue is no longer a problem?

LucidTaZ commented 4 years ago

I still see the issue on Docker Desktop 2.4.0.0 (48506)

mishavantol commented 4 years ago

Still happening on 2.4.1.0 (48583). Need to restart docker every morning. After sleep all requests are approximately three times slower.

dmytro-demchyna commented 4 years ago

Same on Docker Desktop 2.4.0.0 (48506)

Jaro22 commented 4 years ago

Still seeing it on Docker Desktop 2.4.1.0 (48583)

scottgeary commented 4 years ago

Before:

/sys/devices/system/clocksource/clocksource0 # cat available_clocksource
tsc hpet acpi_pm
/sys/devices/system/clocksource/clocksource0 # cat current_clocksource
tsc

Sleep the Mac for about 1 a minute or so....

/sys/devices/system/clocksource/clocksource0 # dmesg -T | tail -n 15
[Mon Oct  5 09:01:31 2020] device vethaa18b9d entered promiscuous mode
[Mon Oct  5 09:01:31 2020] IPv6: ADDRCONF(NETDEV_UP): vethaa18b9d: link is not ready
[Mon Oct  5 09:01:32 2020] IPVS: ftp: loaded support on port[0] = 21
[Mon Oct  5 09:01:32 2020] eth0: renamed from veth650fbcc
[Mon Oct  5 09:01:32 2020] IPv6: ADDRCONF(NETDEV_CHANGE): vethaa18b9d: link becomes ready
[Mon Oct  5 09:01:32 2020] br-dd52805928be: port 11(vethaa18b9d) entered blocking state
[Mon Oct  5 09:01:32 2020] br-dd52805928be: port 11(vethaa18b9d) entered forwarding state
[Mon Oct  5 09:03:37 2020] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc' as unstable because the skew is too large:
[Mon Oct  5 09:03:37 2020] clocksource:                       'hpet' wd_now: 5f6deca1 wd_last: 5e667328 mask: ffffffff
[Mon Oct  5 09:03:37 2020] clocksource:                       'tsc' cs_now: 14a95742623e8 cs_last: 14a68725c414c mask: ffffffffffffffff
[Mon Oct  5 09:03:37 2020] tsc: Marking TSC unstable due to clocksource watchdog
[Mon Oct  5 09:03:37 2020] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[Mon Oct  5 09:03:37 2020] sched_clock: Marking unstable (656619985969, -12297681)<-(656645215991, -37522988)
[Mon Oct  5 09:03:37 2020] clocksource: Switched to clocksource hpet
[Mon Oct  5 09:03:38 2020] hrtimer: interrupt took 2680800 ns

After:

/sys/devices/system/clocksource/clocksource0 # cat available_clocksource
hpet acpi_pm
/sys/devices/system/clocksource/clocksource0 # cat current_clocksource
hpet

And then terrible performance on anything that uses system clock calls.

Forgot the add:

niksynov commented 3 years ago

Issue still occurs on:

carlos-granados commented 3 years ago

@niksynov are you using Xdebug 3? It should be fixed with that version

edmundcraske-bjss commented 3 years ago

At most it would be worked around in Xdebug 3? As this is more of an issue with Docker Desktop not handling sleep?

carlos-granados commented 3 years ago

Yes, it is worked around by using a different way to get the time, you are right the the issue still persist in Docker

derickr commented 3 years ago

| Yes, it is worked around by using a different way to get the time, you are right the the issue still persist in Docker

That's not really true. Although Xdebug does use a different way of getting the time, that does not solve this issue. Xdebug 3 solves it by not asking for the time if it doesn't need to. In just debug mode there is no reason for knowing the time, so Xdebug doesn't call the broken docker-on-mac systemcall.