Closed shenshouer closed 8 years ago
This might be related to #1838.
Ping @dongluochen
@shenshouer can you try with master ? dockerswarm/swarm:master
@shenshouer Thanks for reporting the issue! Do you see a lot of threads stuck at syscall
? PR #1843 might address the issue. If you have swarm repo you can build and test it, or it'll be release soon.
goroutine 32255 [syscall, 26 minutes]:
syscall.Syscall(0x0, 0x10e6, 0xc83a34c000, 0x1000, 0x42993d, 0xc822102a80, 0x6424e8)
/usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.read(0x10e6, 0xc83a34c000, 0x1000, 0x1000, 0x72, 0x0, 0x0)
/usr/local/go/src/syscall/zsyscall_linux_amd64.go:783 +0x5f
syscall.Read(0x10e6, 0xc83a34c000, 0x1000, 0x1000, 0x88ecd2dfc509e201, 0x0, 0x0)
/usr/local/go/src/syscall/syscall_unix.go:160 +0x4d
net.(*netFD).Read(0xc83968f960, 0xc83a34c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/fd_unix.go:228 +0x18b
net.(*conn).Read(0xc82fdac528, 0xc83a34c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:172 +0xe4
net/http.noteEOFReader.Read(0x7f282a4ffc88, 0xc82fdac528, 0xc839559bd8, 0xc83a34c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/transport.go:1370 +0x67
net/http.(*noteEOFReader).Read(0xc8224cf060, 0xc83a34c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
<autogenerated>:126 +0xd0
bufio.(*Reader).fill(0xc83a2dac00)
/usr/local/go/src/bufio/bufio.go:97 +0x1e9
bufio.(*Reader).Peek(0xc83a2dac00, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/bufio/bufio.go:132 +0xcc
net/http.(*persistConn).readLoop(0xc839559b80)
/usr/local/go/src/net/http/transport.go:876 +0xf7
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:685 +0xc78
@vieux @dongluochen The swarm I used with binary file which I build with the master branch of source code.
everything is ok when I use docker command to run containers , command like:
for i in {1..9000};do docker -H :4000 run -d --name=web_$i nginx;done
So the problem is docker-compose ?
@shenshouer Would you mind sharing your compose file?
@dongluochen version of compose I used :
docker-compose version: 1.5.1
docker-py version: 1.5.0
CPython version: 2.7.9
OpenSSL version: OpenSSL 1.0.1e 11 Feb 2013
@shenshouer We released [swarm 1.1.3-rc1[(https://github.com/docker/swarm/releases/tag/v1.1.3-rc1) today. It'd be great if you can run a test to see if it fixes this issue.
ok, I will test it one moment
my testing environment:
OS:
worker-xxx-120 yooadmin # cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=962.0.0
VERSION_ID=962.0.0
BUILD_ID=2016-02-18-0540
PRETTY_NAME="CoreOS 962.0.0 (Coeur Rouge)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
docker:Docker version 1.10.1, build 88b8b3a
docker-compose:docker-compose version: 1.5.1
swarm: swarm version 1.1.3 (da37fb5)
swarm-manager.service:
[Unit]
Description=swarm
After=docker.service consul-server.service
Requires=docker.service consul-server.service
[Service]
Environment=GOMAXPROCS=40
TimeoutStartSec=0
Restart=always
LimitNOFILE=1048576
LimitNPROC=1048576
# ExecStartPre=-/usr/bin/docker rm swarm
ExecStart=/opt/bin/swarm --experimental --log-level debug manage \
-H :4000 \
--replication \
--advertise xxx.xxx:4000 \
--replication-ttl 20s \
--engine-refresh-min-interval 2s \
--engine-refresh-max-interval 10s \
--heartbeat 10s \
consul://127.0.0.1:8500
# ExecStop=/usr/bin/docker kill swarm
[Install]
WantedBy=multi-user.target
swarm-node.service:
[Unit]
Description=swarm
After=docker.service consul-server.service
Requires=docker.service consul-server.service
[Service]
Environment=GOMAXPROCS=40
TimeoutStartSec=0
Restart=always
LimitNOFILE=1048576
LimitNPROC=1048576
# ExecStartPre=-/usr/bin/docker rm swarm
ExecStart=/opt/bin/swarm --experimental --log-level debug join \
--advertise xxx.xxx:2375 \
--ttl 20s \
--heartbeat 10s \
consul://127.0.0.1:8500
# ExecStop=/usr/bin/docker kill swarm
[Install]
WantedBy=multi-user.target
docker-compose.yml:
web:
image: nginx
restart: always
labels:
com.docker.swarm.reschedule-policies: "[\"on-node-failure\"]"
run container with docker-compose:
DOCKER_HOST=127.0.0.1:4000 /opt/bin/docker-compose up --force-recreate -d
scale container number to 3000
DOCKER_HOST=127.0.0.1:4000 /opt/bin/docker-compose scale web=3000
and I got the error:
ERROR: for 54 Container created but refresh didn't report it back
ERROR: for 58 Container created but refresh didn't report it back
ERROR: for 152 ('Connection aborted.', error(24, 'Too many open files'))
ERROR: Couldn't connect to Docker daemon at http://127.0.0.1:4000 - is it running?
If it's at a non-standard location, specify the URL with the DOCKER_HOST environment variable.
# get the running containers after error.
worker-xxx-120 yooadmin # DOCKER_HOST=127.0.0.1:4000 /opt/bin/docker-compose ps |wc -l
1390
worker-xxx-120 yooadmin # docker -H :4000 ps |wc -l
1285
and in order to get 3000 running containers , need to continuous execute this command many times. Then I stop the running containers:
DOCKER_HOST=127.0.0.1:4000 /opt/bin/docker-compose stop
ERROR: for yooadmin_web_408 ('Connection aborted.', error(24, 'Too many open files'))
ERROR: Couldn't connect to Docker daemon at http://127.0.0.1:4000 - is it running?
If it's at a non-standard location, specify the URL with the DOCKER_HOST environment variable.
Very lucky in executing the following commands without error:
DOCKER_HOST=127.0.0.1:4000 /opt/bin/docker-compose rm -f
the same with docker-compose version 1.6.2, build 4d72027
I have the same issue.
runtime: program exceeds 10000-thread limit
fatal error: thread exhaustion
runtime stack:
runtime.throw(0xfb32a0, 0x11)
/usr/local/go/src/runtime/panic.go:527 +0x90
runtime.checkmcount()
/usr/local/go/src/runtime/proc1.go:93 +0x94
runtime.mcommoninit(0xc842684000)
/usr/local/go/src/runtime/proc1.go:113 +0xf2
runtime.allocm(0xc820016000, 0x0, 0x40dc74)
/usr/local/go/src/runtime/proc1.go:868 +0x9c
runtime.newm(0x0, 0xc820016000)
/usr/local/go/src/runtime/proc1.go:1099 +0x2f
runtime.startm(0xc820016000, 0xc820000400)
/usr/local/go/src/runtime/proc1.go:1186 +0xef
runtime.handoffp(0xc820016000)
/usr/local/go/src/runtime/proc1.go:1209 +0x2c7
runtime.retake(0x1d84d93ba3f77, 0x0)
/usr/local/go/src/runtime/proc1.go:3126 +0x1e0
runtime.sysmon()
/usr/local/go/src/runtime/proc1.go:3053 +0x20f
runtime.mstart1()
/usr/local/go/src/runtime/proc1.go:715 +0xe8
runtime.mstart()
/usr/local/go/src/runtime/proc1.go:685 +0x72
goroutine 1 [chan receive, 1550 minutes]:
github.com/docker/swarm/api.(*Server).ListenAndServe(0xc82011a1b0, 0x0, 0x0)
/go/src/github.com/docker/swarm/api/server.go:117 +0x383
github.com/docker/swarm/cli.manage(0xc8200f87e0)
/go/src/github.com/docker/swarm/cli/manage.go:332 +0x1c3a
github.com/codegangsta/cli.Command.Run(0xefe130, 0x6, 0xefe0d0, 0x1, 0x0, 0x0, 0x0, 0xf96100, 0x17, 0x0, ...)
/go/src/github.com/docker/swarm/Godeps/_workspace/src/github.com/codegangsta/cli/command.go:137 +0x107e
github.com/codegangsta/cli.(*App).Run(0xc8200f85a0, 0xc820060000, 0x5, 0x5, 0x0, 0x0)
/go/src/github.com/docker/swarm/Godeps/_workspace/src/github.com/codegangsta/cli/app.go:176 +0xfff
github.com/docker/swarm/cli.Run()
/go/src/github.com/docker/swarm/cli/cli.go:65 +0x562
main.main()
/go/src/github.com/docker/swarm/main.go:13 +0x14
goroutine 21 [chan receive]:
github.com/golang/glog.(*loggingT).flushDaemon(0x13c8ba0)
/go/src/github.com/docker/swarm/Godeps/_workspace/src/github.com/golang/glog/glog.go:882 +0x67
created by github.com/golang/glog.init.1
/go/src/github.com/docker/swarm/Godeps/_workspace/src/github.com/golang/glog/glog.go:410 +0x297
@shenshouer Thanks for providing the related info!
@anarcher What are the top 2 frequent threads in your dump? Can you describe your set up and what workload gets to the crash?
@shenshouer I do not get swarm crash from 10000-thread limit
by running your application. I find an issue that docker-compose may flood swarm. (There are several minor differences from your setup that I may look into. I'm not running compose from same host, but from external so that I can monitor components separately.) And docker-compose may interrupt its operation from this.
When docker-compose tries to scale web instance to 3000. It generates 3000 network connections to DOCKER_HOST simultaneously. It doesn't kill swarm but the cluster (node and/or swarm manager) is not able to handle such simultaneous requests and some requests fail. I think that's the major reason that you cannot scale to 3000. If you do it sequentially to swarm manager, it shouldn't be a problem.
core@ip-172-19-231-118 ~/crash $ ../docker-compose scale web=3000
Creating and starting 3000 ...
Creating and starting 3001 ...
...
core@ip-172-19-231-118 ~ $ for i in {1..100}; do netstat -an | grep -i 3375 | wc -l; sleep 2; done
0
1
2
2
2
2
2
552
1306
1904
2420
2441
2442
2442
2442
2442
2441
2441
2441
2441
2441
2447
2474
2477
2483
2520
2843
2899
2899
2899
2899
2899
2899
2899
2899
2848
2637
2092
1870
1544
1469
1422
1335
1296
1264
1224
1209
1190
1170
1143
1088
1043
996
939
854
524
336
336
336
336
125
100
99
99
99
98
98
98
98
79
20
19
20
1
1
@shenshouer @anarcher So far I think the problem is request spike causing swarm manager and docker node resource starvation. In my test with Swarm 1.1.3-rc1 I did not get to 10000 thread limit when docker-compose stopped first. I think docker-compose should have flow control to not flood swarm. And swarm should implement self protection to handle high load, and protect docker engine when it's close to resource starvation.
Here are swarm logs showing the issue. Swarm manager suffers from too many open concurrent connections.
ERRO[2234] HTTP error: Post http://dong-wily-2:2375/v1.15/containers/create?name=crash_web_5761: dial tcp 172.19.209.201:2375: socket: too many open files. Are you trying to connect to a TLS-enabled daemon without TLS? status=500
2016/02/27 07:55:15 http: Accept error: accept tcp [::]:3375: accept4: too many open files; retrying in 20ms
Docker daemons experience memory starvation or file handler.
ERRO[2236] HTTP error: 500 Internal Server Error: Cannot start container e587188081c1be349e1ead35a4dc5315bcfbdd8c67d3ef52f07102a7084a410c: [9] System error: fork/exec /proc/self/exe: cannot allocate memory
status=500
...
ERRO[2238] HTTP error: 500 Internal Server Error: devmapper: Error saving transaction metadata: devmapper: Error creating metadata file: open /var/lib/docker/devicemapper/metadata/.tmp676640282: too many open files
status=500
Hi, @dongluochen
I'm sorry about I have no logs about it now. I used to use the Interlock
(v0.3.3) with interlock's stat plugin. I suspect the plugin makes many requests, so that this issue appeared.
After turn out the interlock container, the problem does not appear.
Thanks, anarcher
@dongluochen Use the docker run
command is not wrong, but there will be some error with docker-compose. I will test again next week and will log list.
@shenshouer Feel free to re-open if you think it still has problem. Your test results are welcome.
When I run
docker-compose scale -t 30 web=1000
, reportruntime: program exceeds 10000-thread limit
very frequentlyversion of swarm:
swarm version 1.1.2 (f947993)
OS:
docker