Closed anton-siardziuk closed 2 years ago
I started working on a way to easily reproduce this issue outside of my project. My first call was to try to run some C compilation so I created docker-compose.yml
file like this:
version: '3.3'
services:
benchmark:
image: php:7.2.14-cli-stretch
command: docker-php-ext-install json pdo pdo_mysql
And run it like that
time docker-compose up
Where docker-php-ext-install
basically compiles those extensions.
With no luck: compile time is the same before and after sleep (~25s).
So it is not CPU who works slower. Will try to investigate further. My next idea is to run some strace
and compare results from fast and slow tests, but it might take some time to dig through these results. Any ideas how to pin it down are welcome.
More info on this problem. Ran a test php script through strace
and the difference between good and bad performance straces is millions (literally) calls of gettimeofday
when performance is bad. I also found this clock related logs:
2019-01-13 13:54:39.119934+0300 localhost com.docker.driver.amd64-linux[35411]: proxy << POST /v1.39/exec/a177f7c2bbbd2fc5816387e002da09680c89d4b828e59dc7627513a9e4e4b737/resize?
h=41&w=179 (181.771536ms)
2019-01-13 13:56:40.584949+0300 localhost com.docker.vpnkit[35410]: TCP 192.168.65.3:53910 > 212.211.132.250:80 closing flow due to idle port disconnection
2019-01-13 13:56:40.585113+0300 localhost com.docker.vpnkit[35410]: TCP 192.168.65.3:33558 > 212.211.132.250:80 closing flow due to idle port disconnection
2019-01-13 14:01:47.719443+0300 localhost kernel[0]: PM notification timeout (pid 35411, com.docker.drive)
2019-01-13 14:01:47.719449+0300 localhost kernel[0]: PM notification timeout (pid 35411, com.docker.drive)
2019-01-13 14:06:09.957178+0300 localhost com.docker.hyperkit[35416]: [ 1616.560466] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
2019-01-13 14:06:09.964637+0300 localhost com.docker.hyperkit[35416]: [ 1616.579018] clocksource: 'hpet' wd_now: 3bd42798 wd_last: 3ad23599 mask: ffffffff
2019-01-13 14:06:09.967614+0300 localhost com.docker.hyperkit[35416]: [ 1616.581360] clocksource: 'tsc' cs_now: bf689879560 cs_last: b761a0f445c mask: ffffffffffffffff
2019-01-13 14:06:09.972656+0300 localhost com.docker.hyperkit[35416]: [ 1616.587694] Hangcheck: hangcheck value past margin!
2019-01-13 14:06:09.990049+0300 localhost com.docker.hyperkit[35416]: [ 1616.602045] clocksource: Switched to clocksource hpet
2019-01-13 14:06:09.997985+0300 localhost com.docker.driver.amd64-linux[35411]: Requesting time sync
2019-01-13 14:06:10.420056+0300 localhost com.docker.driver.amd64-linux[35411]: cannot read sysevent: EOF
So after a lot of gdb
and strace
and google, I think I understand what is going on. Linux on wakeup thinks that super fast but not that reliable TSC clock no longer up to date and switches to super reliable but not that fast HPET clock. First one works without any syscalls - in userspace, but HPET works though syscall, hence millions of gettimeofday
sys calls when performance is bad. And PHP applications in development mode generate millions calls of current time (because of how debugger in PHP works). That's why this clock switch is very noticeable - everything is 3 times slower after wakeup.
Nice catch - I have similar issues with big Symfony (+ES) project. Until sleep - everything seems to be OK, after wakeup - 10x degraded performance. I Use docker-sync to eliminate bind mount issues (thinking about mutagen tho), but this thing will happen even without any mounts. Hope, that someone from dev team will take a closer look at it. Now, Docker Desktop is really problematic software to work it - requires workarounds, constant restarts, has timedrifts ...
I'm having this issue also, on fedora since a while actually (almost 6 months).
We are having exactly the same issue and it's blocking us from migrating local development to a docker setup. We are on the latest stable version: 2.0.0.3 (31259), Mac 10.14.3.
For testing purposes: I have created a small container that runs the following https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/ example c script:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int
main(int argc, char *argv[])
{
struct timeval tv;
int i = 0;
for (; i<100; i++) {
gettimeofday(&tv,NULL);
}
return 0;
}
#!/bin/bash
while true; do
strace -ce gettimeofday /gettimeofdaytest/test
sleep 10
done
After closing the Mac and waiting a minute or 2 the script starts outputting trace information indicating the calls are now system calls. The source of the issue seems to also be that unstable TSC Clocksource.
linuxkit-025000000001:~# echo 'tsc' > /sys/devices/system/clocksource/clocksourc e0/current_clocksource [ 4483.670063] clocksource: Override clocksource tsc is unstable and not HRT compatible - cannot switch while in HRT/NOHZ mode
When I restart docker desktop the problem is gone again.
Was going to create a new support ticket until I saw this one. I exactly have the same issue, and also with a php project with a lot of mounted files. Only a docker for mac restart fixes the issue! Knowing that most mac user put their mac in sleep mode this will be very annoying.
Hope you can find a solution to this.
Same here. Hoping for a quick fix!
@m00t: kudos to you. We are experiencing the same problem with gridlabd
. A complete validation run normally takes about 400 seconds. After sleep it typically takes around 2000 seconds. Your conclusion is entirely consistent with our findings. After a lot of playing around with linux builds on different VM platforms we decided it had to be in the docker system itself. For example, after the same sleep interval (overnight) centos:7 docker is slow, but a VMware Centos 7 is ok.
The only solution we have now is to restart the docker daemon.
I've noticed this exact behavior, although I've only just had time to investigate. Typically we run a stack of PHP, MySQL and Apache proxied through Nginx and in my test (which is based off a profiler read-out), I'm using CraftCMS with PHP 7.2.
This is a screenshot of the profiler with Docker daemon restarted AND the container stack freshly recreated:
~1.5s to load the page (I am using Mutagen for keeping files in sync).
After putting my computer to sleep and waking it up again, the same page reload yielded the following results:
~6 seconds!?!
Finally, after restarting Docker I was getting the following output:
~3 seconds... Still not at the original 1.5 seconds.
It wasn't until I restarted the docker daemon and re-created my containers that the page load time came back down to 1.5 seconds (Mutagen was left running the entire time).
I have updated to the latest docker ( edge ) version and the problem still persists in exactly the same manner as described above.
Tested on the latest docker Edge version, it still falls back to hpet.
2.0.5.0
docker-desktop:~# [ 450.623931] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc' as unstable because the skew is too large: [ 450.627869] clocksource: 'hpet' wd_now: cf3bfb7 wd_last: c111b9b mask: ffffffff [ 450.630547] clocksource: 'tsc' cs_now: 21d88179225 cs_last: 21c9504986d mask: ffffffffffffffff [ 450.634050] clocksource: Switched to clocksource hpet docker-desktop:~# cat /sys/devices/system/clocksource/clocksource0/current_clock source hpet
Until this issue is resolved in stable version, it is impossible to use docker on Mac.
We’ve been able to use docker on macs as long as we don’t let the Mac sleep. If it sleeps we have to restart docker.
We’ve been able to use docker on macs as long as we don’t let the Mac sleep. If it sleeps we have to restart docker.
Yea, I know, but that is not much of a solution, especially on Mac which sleeps a lot and has limited charges. I don't even use Mac, but we have a team that does and I really can't even understand why. It is so limited and weak. Unfortunately I'm the one handling docker images for them. :(
Indeed. It’s been a real source of frustration for most of us because we use docker alot on laptops. Sleep is very common even when running long jobs. It has to be fixed soon.
Slightly different problem but I think it might be a problem with OSX itself...
Running VBox for a linux VM (so similar situation to Docker) - on OSX sleep and then resume, the whole computer is unusable, requiring a restart...
This is some shoddy OSX resume/restart logic I think. Anyway to downgrade OSX to a version which worked?
@smaudet Never had that issue with vbox. I used for a very long time docker toolbox with nfs enabled and I never had the issue described here. So I doubt it's related.
Slightly different problem but I think it might be a problem with OSX itself...
If I had to guess, it's an interaction between the three systems, and not something that is exclusively the fault of any one system. It works fine on VMware Linux on OSX, but it doesn't work on Docker Linux on OSX, where Linux and OSX are the same versions for both cases. Perhaps when OSX wakes from a sleep, Linux isn't being told what happened and it doesn't reset the clock stability check properly on Docker, whereas on VMware it is. Since VMware can do it but Docker can't, that suggests OSX isn't to blame alone for this problem.
@dchassin the curious thing is there isn't an issue on the windows or linux host side of things... perhaps it is partly related to the software? I.e. VMWare catches the sleep event while Docker and VBox don't...
Could be hardware too, I've not ruled that out yet (would try to BootCamp windows or linux host and see if the problem persists).
I have created an issue in the Xdebug repo asking for a way to disable this call to getTimeOfDay as this is just used to timestamp stack entries and I think that 99% of the applications that use Xdebug don't need this info
Thanks @carlos-granados for your initiative. This could decrease the symptoms of the slow clocksource problem.
Since getTimeOfDay is used in many other cases than debugging: It would be great, if this issue on Macs could be solved at the core - so inside docker. (Also there are other low level calls wich are slow in debugging. E.g. getPid.)
Is there a different clock source that can be used that does not have this problem?
Is there a different clock source that can be used that does not have this problem?
There are three clock sources available most of the time. tsc
, hpet
and acpi_pm
. If TSC desyncs (causing this issue) then we're left with only hpet
and acpi_pm
.
I'll try to get around to running some benchmarks with acpi
instead, but from my reading about different clock sources I suspect anything non-TSC will end up with the same issue.
As a side note, here's a couple of helpful bash scripts that we've been using internally lately to at least increase visibility of the issue, which may or may not be useful to some others experiencing this issue.
Verifying if this is currently effecting your containers:
YELLOW="\033[0;33m"
RED="\033[31m"
RESET="\033[0m"
DOCKER_CLOCK_SOURCE=$(docker exec -ti {my_docker_container} /bin/bash -c 'cat /sys/devices/system/clocksource/clocksource0/current_clocksource | tr -d "[:space:]"')
if [[ ${DOCKER_CLOCK_SOURCE} != "tsc" ]]; then
echo -e "${RED}WARNING!${RESET} - ${YELLOW}Clock drift detected. Please restart Docker Desktop to avoid severe performance issues${RESET}"
fi
Automatically killing and restarting Docker for Desktop via CLI:
echo "Killing Docker"
osascript -e 'quit app "Docker"';
echo "Restarting Docker"
open -a Docker
echo "Waiting for Docker to start"
while ! docker system info > /dev/null 2>&1; do printf "."; sleep 1; done; printf "\n";
I'm currently running into the same issue. Test suite which should run for 2 mins max, runs 11 minutes after sleep. After restarting, it's back to 2 minutes.. 🤔
Sometime during the last week Centos:7 slowed down significantly when running gridlabd validation tests, which uses time calls a lot. I wonder if it has anything to do with the change to the clock calls.
Also suffering from this issue.
I've created a script inspired by @Dolondro's scripts that runs a polling check alongside docker-compose up
to detect clock changes which automatically reboots docker and your containers when that happens.
CONTAINER_NAME="my_container"
trap stop_docker INT
function stop_docker {
wait $PROCESS
exit
}
echo "Ensuring all containers are built"
docker-compose up --no-start
echo "Booting containers"
docker-compose up &
PROCESS=$!
sleep 5
while true; do
CLOCK_SOURCE=`docker exec -ti $CONTAINER_NAME /bin/bash -c 'cat /sys/devices/system/clocksource/clocksource0/current_clocksource'| tr -d '[:space:]'`
if [ "$CLOCK_SOURCE" != "tsc" ]; then
echo "Restarting docker now to fix out-of-sync hardware clock!"
docker ps -q | xargs -L1 docker stop
osascript -e 'quit app "Docker"'
open --background -a Docker
echo "Giving docker time to start..."
until docker info 2> /dev/null 1> /dev/null; do
sleep 1
done
echo "Docker is up and running again! Booting containers!"
docker-compose up &
PROCESS=$!
fi
sleep 5
done
Bump, same issue, really hoping someone can find a solution that is not to restart Docker after waking up and seeing the clock is then set to use hpet
:(
Also with version 2.1.0.5.
Diagnostics ID: 959AB9DD-E4E2-43B6-94C6-2C4058050B79/20200215151633
Are you sure this is just a Docker issue? I've been having bad PHP performance issues on my Mac, whether running in Docker or running on my host machine. After spending a day experimenting and researching to try to find the cause of it, I stumbled upon this thread.
I haven't used gdb to profile PHP yet, but I have experimentally found that including a file (via a simple include("filename.php");
takes 0.0001 seconds normally and about 0.003 seconds (30x longer!) when PHP is going slow.
After seeing this thread, I tried out the same reproduction steps:
This could be a PHP issue? I'll be honest, I don't fully understand what tsc
vs hpet
means, but I think this issue (or at least a remotely related one) extends beyond Docker
@stevendesu blame can be spread around- apple, docker, and to a very much lesser degree Xdebug itself (maybe gettimeofday() isn't really necessary during debug?, that is a matter of opinion).
@stevendesu blame can be spread around- apple, docker, and to a very much lesser degree Xdebug itself (maybe gettimeofday() isn't really necessary during debug?, that is a matter of opinion).
Hi Ralph,
It is not just x-debug which is effected; to a lesser degree also new relic for example is effected. It really depends on what type of application one is running if its noticeable of just measurable, in our case we are running a magento 1 shop and well, you feel it.
As an experiment I eliminated all the calls to ftime()
in gridlabd but it didn't make much difference. Do anyone know whether there are other system calls that are affected by this problem?
This is probably a good source of information to track down which sys calls that are affected: https://0xax.gitbooks.io/linux-insides/Timers/linux-timers-7.html
I personally noticed that this problem affected the performance of include
/ require
/ require_once
calls, specifically when relying on the resolver (if a full path was not provided). So when I did something like require "\ThirdPartyLib\Lib.php";
to pull in a Composer dependency.
My guess is it was using fstat
or some similar system call to check for file existence (iteratively checking each directory in the include path) and this call was affected
I'm having what I think is the same problem on Windows 10, and it's not just containers that are slow. The DockerDesktop UI as well
(At least) libc has another call to get to timing information, "clock()", and "clock_gettime()". Although this isn't wall time, but CPU time, I was wondering whether it would have the same problem as "gettimeofday()"?
Issues go stale after 90 days of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
comment.
Stale issues will be closed after an additional 30 days of inactivity.
Prevent issues from auto-closing with an /lifecycle frozen
comment.
If this issue is safe to close now please do so.
Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale
/lifecycle frozen
Can anyone confirm that this issue is no longer a problem?
I still see the issue on Docker Desktop 2.4.0.0 (48506)
Still happening on 2.4.1.0 (48583). Need to restart docker every morning. After sleep all requests are approximately three times slower.
Same on Docker Desktop 2.4.0.0 (48506)
Still seeing it on Docker Desktop 2.4.1.0 (48583)
Before:
/sys/devices/system/clocksource/clocksource0 # cat available_clocksource
tsc hpet acpi_pm
/sys/devices/system/clocksource/clocksource0 # cat current_clocksource
tsc
Sleep the Mac for about 1 a minute or so....
/sys/devices/system/clocksource/clocksource0 # dmesg -T | tail -n 15
[Mon Oct 5 09:01:31 2020] device vethaa18b9d entered promiscuous mode
[Mon Oct 5 09:01:31 2020] IPv6: ADDRCONF(NETDEV_UP): vethaa18b9d: link is not ready
[Mon Oct 5 09:01:32 2020] IPVS: ftp: loaded support on port[0] = 21
[Mon Oct 5 09:01:32 2020] eth0: renamed from veth650fbcc
[Mon Oct 5 09:01:32 2020] IPv6: ADDRCONF(NETDEV_CHANGE): vethaa18b9d: link becomes ready
[Mon Oct 5 09:01:32 2020] br-dd52805928be: port 11(vethaa18b9d) entered blocking state
[Mon Oct 5 09:01:32 2020] br-dd52805928be: port 11(vethaa18b9d) entered forwarding state
[Mon Oct 5 09:03:37 2020] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc' as unstable because the skew is too large:
[Mon Oct 5 09:03:37 2020] clocksource: 'hpet' wd_now: 5f6deca1 wd_last: 5e667328 mask: ffffffff
[Mon Oct 5 09:03:37 2020] clocksource: 'tsc' cs_now: 14a95742623e8 cs_last: 14a68725c414c mask: ffffffffffffffff
[Mon Oct 5 09:03:37 2020] tsc: Marking TSC unstable due to clocksource watchdog
[Mon Oct 5 09:03:37 2020] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[Mon Oct 5 09:03:37 2020] sched_clock: Marking unstable (656619985969, -12297681)<-(656645215991, -37522988)
[Mon Oct 5 09:03:37 2020] clocksource: Switched to clocksource hpet
[Mon Oct 5 09:03:38 2020] hrtimer: interrupt took 2680800 ns
After:
/sys/devices/system/clocksource/clocksource0 # cat available_clocksource
hpet acpi_pm
/sys/devices/system/clocksource/clocksource0 # cat current_clocksource
hpet
And then terrible performance on anything that uses system clock calls.
Forgot the add:
Issue still occurs on:
@niksynov are you using Xdebug 3? It should be fixed with that version
At most it would be worked around in Xdebug 3? As this is more of an issue with Docker Desktop not handling sleep?
Yes, it is worked around by using a different way to get the time, you are right the the issue still persist in Docker
| Yes, it is worked around by using a different way to get the time, you are right the the issue still persist in Docker
That's not really true. Although Xdebug does use a different way of getting the time, that does not solve this issue. Xdebug 3 solves it by not asking for the time if it doesn't need to. In just debug
mode there is no reason for knowing the time, so Xdebug doesn't call the broken docker-on-mac systemcall.
Expected behavior
Test scenario should always take 30s to complete
Actual behavior
Test scenario takes 30s to complete after Docker for Mac restart, and takes 1m30s-2m to complete after sleep and wakeup.
Information
I have some PHP test scenario for my application I can run to check how good or bad performance is. I also have a linux laptop to compare with. This test in docker on linux completes in 30s. When I restart Docker for Mac, it runs 30s on Docker for Mac as well, but after I put MacBook to sleep and wake it up, the very same test takes 1m30s-2m. And it would take this long until I restart Docker for Mac. After restart, this test takes 30s again. I can reproduce it very reliably every time I try. The very same issue was reported a year ago here: https://github.com/EugenMayer/docker-sync/issues/524
Diagnostic logs
Steps to reproduce the behavior