docker-archive / classicswarm

Swarm Classic: a container clustering system. Not to be confused with Docker Swarm which is at https://github.com/docker/swarmkit
Apache License 2.0
5.76k stars 1.08k forks source link

swarm on mesos can't allocate container #1853

Closed keyboardfann closed 6 years ago

keyboardfann commented 8 years ago

Dear All,

I follow swarm on mesos readme guide. run docker swarm manager and it register to mesos , but when i want to allocate container on mesos , it show the following error.

docker: Error response from daemon: container failed to start after 5s.

Any one meet the problem and could you give me some feedback or solution? Thank you your time for looking the problem.

Env:

OS: CentOS  7.1
Kernel: 4.4.2-1.el7.elrepo.x86_64
Mesos version: 0.27 mesosphere rpm
docker version: 1.10
docker swarm version: 1.1

Testing host:

kmaster.com  10.1.20.90  mesos master ,mesos slave ,docker, docker swarm manager
kslave.com 10.1.20.91  mesos slave,docker 

Run docker swarm

docker run -d -p 4000:2375 -p 3375:3375 \
swarm manage \
-c mesos-experimental \
--cluster-opt mesos.address=0.0.0.0 \
--cluster-opt mesos.port=3375 \
10.1.20.90:5050

Mesos Framework web

Active Frameworks
ID  Host    User    Name    Active Tasks    CPUs    Mem Disk    Max Share   Registered  Re-Registered
…be54-2ae8a2667ad0-0009   d8878dd94db3    root    swarm   0   2   1.9 GB  6.7 GB  100%    27 minutes ago  -

Docker info

[root@kmaster ~]# docker -H :4000  info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.1.2
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Offers: 0
Plugins: 
 Volume: 
 Network: 
Kernel Version: 4.4.2-1.el7.elrepo.x86_64
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: d8878dd94db3

Allocate docker container on swarm on mesos

[root@kmaster ~]# docker -H :4000  run --cpu-shares 2 -m 256m hello-world
docker: Error response from daemon: container failed to start after 5s.
See 'docker run --help'.

Swarm manager log:

[root@kmaster ~]# docker logs d8878dd94db3
time="2016-02-22T07:31:24Z" level=warning msg="WARNING: the mesos driver is currently experimental, use at your own risks" 
ERROR: logging before flag.Parse: I0222 07:31:24.024666   1 scheduler.go:323] Initializing mesos scheduler driver
ERROR: logging before flag.Parse: I0222 07:31:24.024728   1 scheduler.go:792] Starting the scheduler driver...
ERROR: logging before flag.Parse: I0222 07:31:24.024772   1 http_transporter.go:407] listening on 0.0.0.0 port 3375
ERROR: logging before flag.Parse: I0222 07:31:24.024783   1 scheduler.go:809] Mesos scheduler driver started with PID=scheduler(1)@172.17.0.2:3375
time="2016-02-22T07:31:24Z" level=info msg="Listening for HTTP" addr=":2375" proto=tcp 
ERROR: logging before flag.Parse: I0222 07:31:24.047018   1 scheduler.go:374] New master master@10.1.20.90:5050 detected
ERROR: logging before flag.Parse: I0222 07:31:24.047046   1 scheduler.go:435] No credentials were provided. Attempting to register scheduler without authentication.
ERROR: logging before flag.Parse: I0222 07:31:24.052278   1 scheduler.go:535] Framework registered with ID=cc5df891-2d6b-45ec-be54-2ae8a2667ad0-0009
time="2016-02-22T07:31:34Z" level=error msg="Get http://kslave.com:2375/v1.15/info: dial tcp 182.92.102.7:2375: i/o timeout. Are you trying to connect to a TLS-enabled daemon without TLS?" 
time="2016-02-22T07:31:44Z" level=error msg="Get http://kmaster.com:2375/v1.15/info: dial tcp 141.8.225.124:2375: i/o timeout. Are you trying to connect to a TLS-enabled daemon without TLS?" 
time="2016-02-22T07:32:20Z" level=error msg="HTTP error: container failed to start after 5s" status=500 
time="2016-02-22T07:36:56Z" level=error msg="HTTP error: container failed to start after 5s" status=500 
time="2016-02-22T08:02:55Z" level=error msg="HTTP error: container failed to start after 5s" status=500 

Allocate container on local is ok

[root@kmaster ~]# docker  run --cpu-shares 2 -m 256m hello-world

Hello from Docker.
This message shows that your installation appears to be working correctly.
Spritekin commented 8 years ago

5s is not enough time to fetch the image so it fails. Increase your container timeout. In swarm, use the env variable SWARM_MESOS_TASK_TIMEOUT=180s or the parameter "--cluster-opt mesos.tasktimeout=180s"

Also, you might need to change your mesos slave configuration to provide a larger registration time: --executor_registration_timeout=3mins

Luck!

keyboardfann commented 8 years ago

Dear @Spritekin ,

Thank you for the feedback , I enlarge the timeout but it still fail.

Start mesos slave and add timeout 3min

mesos-slave --master=zk://kmaster.com:2181/mesos --executor_registration_timeout=3mins --containerizers=docker,mesos --launcher=posix </dev/null >/dev/null &

Add timeout 180s when start swarm manager

docker run -d -p 4000:2375 -p 3375:3375 \
swarm manage \
    -c mesos-experimental \
    --cluster-opt mesos.address=0.0.0.0 \
    --cluster-opt mesos.port=3375 \
    --cluster-opt mesos.tasktimeout=180s \
    10.1.20.90:5050

Allocate container and still fail after 3min

[root@kmaster bin]# docker -H :4000   run --cpu-shares 2 -m 256m hello-world

docker: Error response from daemon: container failed to start after 3m0s.

docker swarm log

[root@kmaster ~]# docker logs 144d7d5d289c
time="2016-02-23T04:53:42Z" level=warning msg="WARNING: the mesos driver is currently experimental, use at your own risks" 
ERROR: logging before flag.Parse: W0223 04:53:42.403937       1 scheduler.go:177] Failed to obtain username: user: Current not implemented on linux/amd64
ERROR: logging before flag.Parse: I0223 04:53:42.404075       1 scheduler.go:323] Initializing mesos scheduler driver
ERROR: logging before flag.Parse: I0223 04:53:42.404127       1 scheduler.go:792] Starting the scheduler driver...
ERROR: logging before flag.Parse: I0223 04:53:42.404178       1 http_transporter.go:407] listening on 0.0.0.0 port 3375
ERROR: logging before flag.Parse: I0223 04:53:42.404190       1 scheduler.go:809] Mesos scheduler driver started with PID=scheduler(1)@172.17.0.2:3375
time="2016-02-23T04:53:42Z" level=info msg="Listening for HTTP" addr=":2375" proto=tcp 
ERROR: logging before flag.Parse: I0223 04:53:42.425921       1 scheduler.go:374] New master master@10.1.20.90:5050 detected
ERROR: logging before flag.Parse: I0223 04:53:42.425981       1 scheduler.go:435] No credentials were provided. Attempting to register scheduler without authentication.
ERROR: logging before flag.Parse: I0223 04:53:42.433117       1 scheduler.go:535] Framework registered with ID=2a01bb10-ddf2-4581-ae3d-2c0782cc72c2-0001
time="2016-02-23T04:53:52Z" level=error msg="Get http://kmaster.com:2375/v1.15/info: dial tcp 141.8.225.124:2375: i/o timeout. Are you trying to connect to a TLS-enabled daemon without TLS?" 
howardshaw commented 8 years ago

@keyboardfann I meet the same problem as you,did u fix the problem?

keyboardfann commented 8 years ago

@howardshaw No , I still be hang by the error. Maybe I think the docker swarm manager use the public DNS resolve , so kmaster.com the ip i setup is 10.1.20.90(private ip) , but the docker swarm manager want get info from 141.8.225.124(public DNS record from whois)

howardshaw commented 8 years ago

I test with vagrant box bentos/centos-7.1 env

swarm version 1.1.2 (f947993)
mesos-0.28.0+git20160221.1e65c7f-0.2.220.centos701406.x86_64
kernel:4.4.2-1.el7.elrepo.x86_64
docker 1.10.2
API version:  1.22

can see swam active framework on mesos web ui

Active Frameworks

ID  Host    User    Name    Active Tasks    CPUs    Mem Disk    Max Share   Registered  Re-Registered
…be48-74136c171391-0000   c3d292773e49    root    swarm   0   1   200 MB  33.4 GB 100%    9 minutes ago   -

docker swam mange command

mesos slave
--executor_registration_timeout=3mins

docker run -d -p 4375:2375 -p 3375:3375 dockerswarm/swarm --debug manage -c mesos-experimental --cluster-opt mesos.address=0.0.0.0 --cluster-opt mesos.port=3375 --cluster-opt mesos.user=root --cluster-opt mesos.tasktimeout=180s zk://192.168.33.10:2181/mesos 

docker -H tcp://192.168.33.10:4375 run -idt --cpu-shares 2 busybox

docker swarm logs

[vagrant@master ~]$ docker logs c3d
time="2016-02-23T10:52:20Z" level=debug msg="Initializing discovery service" name=zk uri="192.168.33.10:2181/mesos" 
time="2016-02-23T10:52:20Z" level=info msg="Initializing discovery without TLS" 
time="2016-02-23T10:52:20Z" level=debug msg="Initializing strategy" name=spread 
time="2016-02-23T10:52:20Z" level=debug msg="Initializing filter" name=health 
time="2016-02-23T10:52:20Z" level=debug msg="Initializing filter" name=port 
time="2016-02-23T10:52:20Z" level=debug msg="Initializing filter" name=dependency 
time="2016-02-23T10:52:20Z" level=debug msg="Initializing filter" name=affinity 
time="2016-02-23T10:52:20Z" level=debug msg="Initializing filter" name=constraint 
time="2016-02-23T10:52:20Z" level=warning msg="WARNING: the mesos driver is currently experimental, use at your own risks" 
time="2016-02-23T10:52:20Z" level=debug msg="Initializing cluster" name=mesos 
ERROR: logging before flag.Parse: I0223 10:52:20.991635       1 scheduler.go:323] Initializing mesos scheduler driver
ERROR: logging before flag.Parse: I0223 10:52:20.991691       1 scheduler.go:792] Starting the scheduler driver...
ERROR: logging before flag.Parse: I0223 10:52:20.991779       1 http_transporter.go:407] listening on 0.0.0.0 port 3375
ERROR: logging before flag.Parse: I0223 10:52:20.991790       1 scheduler.go:809] Mesos scheduler driver started with PID=scheduler(1)@172.17.0.2:3375
time="2016-02-23T10:52:20Z" level=debug msg="Mesos driver started, status DRIVER_RUNNING" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/images/create" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/start" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/exec" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/networks/{networkid:.*}/connect" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/images/{name:.*}/tag" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/pause" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/rename" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/exec/{execid:.*}/start" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/images/load" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/stop" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/copy" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/kill" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/resize" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/attach" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/images/{name:.*}/push" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/create" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/unpause" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/volumes/create" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/auth" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/commit" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/build" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/restart" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/wait" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/networks/create" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/containers/{name:.*}/update" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/exec/{execid:.*}/resize" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=POST route="/networks/{networkid:.*}/disconnect" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=PUT route="/containers/{name:.*}/archive" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=DELETE route="/containers/{name:.*}" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=DELETE route="/images/{name:.*}" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=DELETE route="/networks/{networkid:.*}" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=DELETE route="/volumes/{name:.*}" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=HEAD route="/containers/{name:.*}/archive" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/info" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/images/json" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/images/{name:.*}/get" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/networks" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/json" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/{name:.*}/archive" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/{name:.*}/changes" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/{name:.*}/json" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/_ping" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/version" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/images/{name:.*}/history" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/ps" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/{name:.*}/logs" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/{name:.*}/stats" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/exec/{execid:.*}/json" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/networks/{networkid:.*}" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/volumes/{volumename:.*}" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/images/viz" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/images/get" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/{name:.*}/export" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/{name:.*}/attach/ws" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/volumes" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/events" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/images/search" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/images/{name:.*}/json" 
time="2016-02-23T10:52:20Z" level=debug msg="Registering HTTP route" method=GET route="/containers/{name:.*}/top" 
time="2016-02-23T10:52:20Z" level=info msg="Listening for HTTP" addr=":2375" proto=tcp 
2016/02/23 10:52:20 Connected to 192.168.33.10:2181
2016/02/23 10:52:20 Connected to 192.168.33.10:2181
2016/02/23 10:52:21 Authenticated: id=95435117501612051, timeout=40000
2016/02/23 10:52:21 Authenticated: id=95435117501612050, timeout=10000
ERROR: logging before flag.Parse: I0223 10:52:21.010677       1 scheduler.go:374] New master master@192.168.33.10:5050 detected
ERROR: logging before flag.Parse: I0223 10:52:21.010693       1 scheduler.go:435] No credentials were provided. Attempting to register scheduler without authentication.
ERROR: logging before flag.Parse: I0223 10:52:21.013451       1 scheduler.go:535] Framework registered with ID=0cf2ed3b-79be-4318-be48-74136c171391-0000
time="2016-02-23T10:52:21Z" level=debug msg="Framework registered" frameworkId=0cf2ed3b-79be-4318-be48-74136c171391-0000 name=mesos 
time="2016-02-23T10:52:21Z" level=debug msg="Offers received" name=mesos offers=1 
time="2016-02-23T10:52:21Z" level=error msg="lookup node on 10.0.2.3:53: server misbehaving" 
time="2016-02-23T10:52:39Z" level=debug msg="HTTP request received" method=GET uri="/v1.22/info" 
time="2016-02-23T10:52:47Z" level=debug msg="HTTP request received" method=POST uri="/v1.22/containers/create" 
time="2016-02-23T10:55:47Z" level=error msg="HTTP error: container failed to start after 3m0s" status=500 

mesos log

Log file created at: 2016/02/23 10:51:35
Running on machine: master
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0223 10:51:35.558143 11238 logging.cpp:188] INFO level logging started!
I0223 10:51:35.558596 11238 main.cpp:237] Build: 2016-02-21 12:15:59 by root
I0223 10:51:35.558611 11238 main.cpp:239] Version: 0.28.0
I0223 10:51:35.558615 11238 main.cpp:246] Git SHA: 1e65c7fd01ce111b6d1a146ec5d82a862fae255f
I0223 10:51:35.558647 11238 main.cpp:260] Using 'HierarchicalDRF' allocator
I0223 10:51:35.566476 11238 leveldb.cpp:174] Opened db in 7.680268ms
I0223 10:51:35.569155 11238 leveldb.cpp:181] Compacted db in 2.637937ms
I0223 10:51:35.569191 11238 leveldb.cpp:196] Created db iterator in 8012ns
I0223 10:51:35.569205 11238 leveldb.cpp:202] Seeked to beginning of db in 5880ns
I0223 10:51:35.569255 11238 leveldb.cpp:271] Iterated through 3 keys in the db in 44100ns
I0223 10:51:35.569288 11238 replica.cpp:779] Replica recovered with log positions 57 -> 58 with 0 holes and 0 unlearned
I0223 10:51:35.570318 11254 log.cpp:236] Attempting to join replica to ZooKeeper group
I0223 10:51:35.570667 11256 recover.cpp:447] Starting replica recovery
I0223 10:51:35.570722 11256 recover.cpp:473] Replica is in VOTING status
I0223 10:51:35.570746 11256 recover.cpp:462] Recover process terminated
I0223 10:51:35.572504 11238 main.cpp:471] Starting Mesos master
I0223 10:51:35.573479 11238 master.cpp:376] Master 0cf2ed3b-79be-4318-be48-74136c171391 (192.168.33.10) started on 192.168.33.10:5050
I0223 10:51:35.573493 11238 master.cpp:378] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_http="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="192.168.33.10" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="1" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://192.168.33.10:2181/mesos" --zk_session_timeout="10secs"
I0223 10:51:35.573616 11238 master.cpp:425] Master allowing unauthenticated frameworks to register
I0223 10:51:35.573621 11238 master.cpp:430] Master allowing unauthenticated slaves to register
I0223 10:51:35.573627 11238 master.cpp:468] Using default 'crammd5' authenticator
W0223 10:51:35.573633 11238 authenticator.cpp:511] No credentials provided, authentication requests will be refused
I0223 10:51:35.590158 11238 authenticator.cpp:518] Initializing server SASL
I0223 10:51:35.602835 11259 master.cpp:1651] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I0223 10:51:35.602859 11259 contender.cpp:147] Joining the ZK group
I0223 10:51:35.603021 11259 group.cpp:349] Group process (group(1)@192.168.33.10:5050) connected to ZooKeeper
I0223 10:51:35.603040 11259 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0223 10:51:35.603046 11259 group.cpp:427] Trying to create path '/mesos/log_replicas' in ZooKeeper
I0223 10:51:35.603179 11256 group.cpp:349] Group process (group(4)@192.168.33.10:5050) connected to ZooKeeper
I0223 10:51:35.603193 11256 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0223 10:51:35.603196 11256 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0223 10:51:35.603302 11258 group.cpp:349] Group process (group(3)@192.168.33.10:5050) connected to ZooKeeper
I0223 10:51:35.603313 11258 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I0223 10:51:35.603327 11258 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0223 10:51:35.603516 11257 group.cpp:349] Group process (group(2)@192.168.33.10:5050) connected to ZooKeeper
I0223 10:51:35.603528 11257 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I0223 10:51:35.603533 11257 group.cpp:427] Trying to create path '/mesos/log_replicas' in ZooKeeper
I0223 10:51:35.611888 11253 detector.cpp:152] Detected a new leader: (id='10')
I0223 10:51:35.612138 11253 contender.cpp:263] New candidate (id='11') has entered the contest for leadership
I0223 10:51:35.612426 11253 network.hpp:413] ZooKeeper group memberships changed
I0223 10:51:35.620558 11256 group.cpp:700] Trying to get '/mesos/json.info_0000000010' in ZooKeeper
I0223 10:51:35.620836 11259 group.cpp:700] Trying to get '/mesos/log_replicas/0000000009' in ZooKeeper
I0223 10:51:35.639495 11256 detector.cpp:479] A new leading master (UPID=master@192.168.33.10:5050) is detected
I0223 10:51:35.639561 11256 master.cpp:1712] The newly elected leader is master@192.168.33.10:5050 with id 71d3473c-e85f-4804-89b6-077f89801eab
I0223 10:51:35.639747 11259 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@192.168.33.10:5050 }
I0223 10:51:35.639787 11259 network.hpp:413] ZooKeeper group memberships changed
I0223 10:51:35.639823 11259 group.cpp:700] Trying to get '/mesos/log_replicas/0000000009' in ZooKeeper
I0223 10:51:35.650660 11259 group.cpp:700] Trying to get '/mesos/log_replicas/0000000010' in ZooKeeper
I0223 10:51:35.651504 11259 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@192.168.33.10:5050 }
I0223 10:51:37.656626 11259 http.cpp:501] HTTP GET for /master/state.json from 192.168.33.1:43343 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36'
I0223 10:51:38.790415 11259 http.cpp:501] HTTP GET for /master/state.json from 192.168.33.1:43341 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36'
I0223 10:51:44.009006 11259 detector.cpp:152] Detected a new leader: (id='11')
I0223 10:51:44.009107 11259 group.cpp:700] Trying to get '/mesos/json.info_0000000011' in ZooKeeper
I0223 10:51:44.011651 11255 network.hpp:413] ZooKeeper group memberships changed
I0223 10:51:44.011737 11255 group.cpp:700] Trying to get '/mesos/log_replicas/0000000010' in ZooKeeper
I0223 10:51:44.012043 11259 detector.cpp:479] A new leading master (UPID=master@192.168.33.10:5050) is detected
I0223 10:51:44.012259 11259 master.cpp:1712] The newly elected leader is master@192.168.33.10:5050 with id 0cf2ed3b-79be-4318-be48-74136c171391
I0223 10:51:44.012284 11259 master.cpp:1725] Elected as the leading master!
I0223 10:51:44.012290 11259 master.cpp:1470] Recovering from registrar
I0223 10:51:44.012435 11259 registrar.cpp:307] Recovering registrar
I0223 10:51:44.012609 11259 log.cpp:659] Attempting to start the writer
I0223 10:51:44.012795 11259 replica.cpp:493] Replica received implicit promise request from (17)@192.168.33.10:5050 with proposal 11
I0223 10:51:44.015032 11257 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@192.168.33.10:5050 }
I0223 10:51:44.015135 11259 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 2.321452ms
I0223 10:51:44.015153 11259 replica.cpp:342] Persisted promised to 11
I0223 10:51:44.015261 11259 coordinator.cpp:238] Coordinator attempting to fill missing positions
I0223 10:51:44.015302 11259 log.cpp:675] Writer started with ending position 58
I0223 10:51:44.015452 11259 leveldb.cpp:436] Reading position from leveldb took 26928ns
I0223 10:51:44.015491 11259 leveldb.cpp:436] Reading position from leveldb took 6736ns
I0223 10:51:44.015828 11259 registrar.cpp:340] Successfully fetched the registry (454B) in 3.364096ms
I0223 10:51:44.015874 11259 registrar.cpp:439] Applied 1 operations in 10404ns; attempting to update the 'registry'
I0223 10:51:44.016747 11259 log.cpp:683] Attempting to append 493 bytes to the log
I0223 10:51:44.016770 11259 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 59
I0223 10:51:44.016886 11259 replica.cpp:537] Replica received write request for position 59 from (18)@192.168.33.10:5050
I0223 10:51:44.018743 11259 leveldb.cpp:341] Persisting action (512 bytes) to leveldb took 1.832708ms
I0223 10:51:44.018771 11259 replica.cpp:712] Persisted action at 59
I0223 10:51:44.018911 11259 replica.cpp:691] Replica received learned notice for position 59 from @0.0.0.0:0
I0223 10:51:44.019932 11259 leveldb.cpp:341] Persisting action (514 bytes) to leveldb took 1.003669ms
I0223 10:51:44.019966 11259 replica.cpp:712] Persisted action at 59
I0223 10:51:44.019973 11259 replica.cpp:697] Replica learned APPEND action at position 59
I0223 10:51:44.020283 11259 registrar.cpp:484] Successfully updated the 'registry' in 4.391168ms
I0223 10:51:44.020323 11259 registrar.cpp:370] Successfully recovered registrar
I0223 10:51:44.020424 11259 log.cpp:702] Attempting to truncate the log to 59
I0223 10:51:44.020500 11259 master.cpp:1522] Recovered 2 slaves from the Registry (454B) ; allowing 10mins for slaves to re-register
I0223 10:51:44.020531 11259 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 60
I0223 10:51:44.020619 11259 replica.cpp:537] Replica received write request for position 60 from (19)@192.168.33.10:5050
I0223 10:51:44.022091 11259 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 1.453589ms
I0223 10:51:44.022119 11259 replica.cpp:712] Persisted action at 60
I0223 10:51:44.022230 11259 replica.cpp:691] Replica received learned notice for position 60 from @0.0.0.0:0
I0223 10:51:44.023128 11259 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 876077ns
I0223 10:51:44.023187 11259 leveldb.cpp:399] Deleting ~2 keys from leveldb took 26015ns
I0223 10:51:44.023202 11259 replica.cpp:712] Persisted action at 60
I0223 10:51:44.023212 11259 replica.cpp:697] Replica learned TRUNCATE action at position 60
I0223 10:51:44.163519 11259 master.cpp:4470] Re-registering slave 8891626a-1f11-46ba-b3aa-1616df93cbf7-S0 at slave(1)@192.168.33.11:5051 (node)
I0223 10:51:44.163646 11259 registrar.cpp:439] Applied 1 operations in 8989ns; attempting to update the 'registry'
I0223 10:51:44.163787 11259 log.cpp:683] Attempting to append 493 bytes to the log
I0223 10:51:44.163807 11259 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 61
I0223 10:51:44.163887 11259 replica.cpp:537] Replica received write request for position 61 from (21)@192.168.33.10:5050
I0223 10:51:44.165007 11259 leveldb.cpp:341] Persisting action (512 bytes) to leveldb took 1.104445ms
I0223 10:51:44.165035 11259 replica.cpp:712] Persisted action at 61
I0223 10:51:44.165144 11259 replica.cpp:691] Replica received learned notice for position 61 from @0.0.0.0:0
I0223 10:51:44.165913 11259 leveldb.cpp:341] Persisting action (514 bytes) to leveldb took 752241ns
I0223 10:51:44.165933 11259 replica.cpp:712] Persisted action at 61
I0223 10:51:44.165940 11259 replica.cpp:697] Replica learned APPEND action at position 61
I0223 10:51:44.166191 11259 registrar.cpp:484] Successfully updated the 'registry' in 2.52416ms
I0223 10:51:44.166263 11259 log.cpp:702] Attempting to truncate the log to 61
I0223 10:51:44.166793 11259 master.cpp:4552] Re-registered slave 8891626a-1f11-46ba-b3aa-1616df93cbf7-S0 at slave(1)@192.168.33.11:5051 (node) with cpus(*):1; mem(*):200; disk(*):34248; ports(*):[31000-32000]
I0223 10:51:44.166841 11259 master.cpp:4587] Sending updated checkpointed resources  to slave 8891626a-1f11-46ba-b3aa-1616df93cbf7-S0 at slave(1)@192.168.33.11:5051 (node)
I0223 10:51:44.166874 11259 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 62
I0223 10:51:44.166960 11259 hierarchical.cpp:473] Added slave 8891626a-1f11-46ba-b3aa-1616df93cbf7-S0 (node) with cpus(*):1; mem(*):200; disk(*):34248; ports(*):[31000-32000] (allocated: )
I0223 10:51:44.167068 11259 replica.cpp:537] Replica received write request for position 62 from (22)@192.168.33.10:5050
I0223 10:51:44.169590 11257 master.cpp:4649] Received update of slave 8891626a-1f11-46ba-b3aa-1616df93cbf7-S0 at slave(1)@192.168.33.11:5051 (node) with total oversubscribed resources 
I0223 10:51:44.169760 11257 hierarchical.cpp:531] Slave 8891626a-1f11-46ba-b3aa-1616df93cbf7-S0 (node) updated with oversubscribed resources  (total: cpus(*):1; mem(*):200; disk(*):34248; ports(*):[31000-32000], allocated: )
I0223 10:51:44.169867 11259 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 2.785932ms
I0223 10:51:44.169880 11259 replica.cpp:712] Persisted action at 62
I0223 10:51:44.169978 11259 replica.cpp:691] Replica received learned notice for position 62 from @0.0.0.0:0
I0223 10:51:44.171080 11259 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 1.086275ms
I0223 10:51:44.171118 11259 leveldb.cpp:399] Deleting ~2 keys from leveldb took 14211ns
I0223 10:51:44.171126 11259 replica.cpp:712] Persisted action at 62
I0223 10:51:44.171131 11259 replica.cpp:697] Replica learned TRUNCATE action at position 62
I0223 10:51:48.851790 11259 http.cpp:501] HTTP GET for /master/state.json from 192.168.33.1:43342 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36'
I0223 10:51:58.857028 11259 http.cpp:501] HTTP GET for /master/state.json from 192.168.33.1:43342 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36'
I0223 10:52:08.863847 11259 http.cpp:501] HTTP GET for /master/state.json from 192.168.33.1:43341 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36'
I0223 10:52:18.874194 11259 http.cpp:501] HTTP GET for /master/state.json from 192.168.33.1:43341 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36'
I0223 10:52:21.011353 11259 master.cpp:2280] Received SUBSCRIBE call for framework 'swarm' at scheduler(1)@172.17.0.2:3375
I0223 10:52:21.011420 11259 master.cpp:2351] Subscribing framework swarm with checkpointing disabled and capabilities [  ]
I0223 10:52:21.011765 11259 hierarchical.cpp:265] Added framework 0cf2ed3b-79be-4318-be48-74136c171391-0000
I0223 10:52:21.011920 11259 master.cpp:5355] Sending 1 offers to framework 0cf2ed3b-79be-4318-be48-74136c171391-0000 (swarm) at scheduler(1)@172.17.0.2:3375
I0223 10:52:28.880982 11259 http.cpp:501] HTTP GET for /master/state.json from 192.168.33.1:43342 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36'
I0223 10:52:29.856844 11259 http.cpp:501] HTTP GET for /master/state.json from 
Spritekin commented 8 years ago

I prefer using env variables for the configuration. Here is my Marathon json file, maybe it helps with your own setup:

{
  "id": "swarmtest",
  "cpus": 2,
  "mem": 1024,
  "ports": [2376,3375],
  "requirePorts": true,
  "args": [
      "manage",
      "--strategy","binpack",
      "--filter=health",
      "-c","mesos-experimental",
      "--cluster-opt","mesos.address=0.0.0.0",
      "--cluster-opt","mesos.port=3375",
      "--cluster-opt","mesos.tasktimeout=180s",
      "--cluster-opt","mesos.offertimeout=15s",
      "zk://mesos-zk.aws.cmcrc.com:2181/mesos"
      ],
  "instances": 1,
  "acceptedResourceRoles": [ "*" ],
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "swarm:1.1.2",
      "network": "HOST",
      "forcePullImage": true
    }
  },
  "env": {
      "HOST_IP": "0.0.0.0",
      "SWARM_HOST": ":2376",
      "SWARM_MESOS_USER": "root",
      "SWARM_MESOS_TASK_TIMEOUT": "180s",
      "SWARM_MESOS_OFFER_TIMEOUT": "15s"
  }
}
howardshaw commented 8 years ago

Thanks @Spritekin I follow you advice and can get resource from mesos now

[vagrant@node ~]$ docker -H :4375 info
Containers: 2
 Running: 0
 Paused: 0
 Stopped: 2
Images: 9
Server Version: swarm/1.1.2
Role: primary
Strategy: binpack
Filters: health
Offers: 2
  Offer: d1e832d1-5bc8-4598-b432-c64eedfa419c-O210
   └ cpus: 1
   └ mem: 228 MiB
   └ disk: 33.45 GiB
   └ ports: 31000-32000
  Offer: d1e832d1-5bc8-4598-b432-c64eedfa419c-O211
   └ ports: 31000-31304, 31307-32000
   └ mem: 72 MiB
   └ disk: 33.45 GiB
Plugins: 
 Volume: 
 Network: 
Kernel Version: 4.4.2-1.el7.elrepo.x86_64
Operating System: linux
Architecture: amd64
CPUs: 1
Total Memory: 300 MiB
Name: node

But when I try to run containers ,I met the Abnormal executor termination problem,but I already set SWARM_MESOS_USER when run swarm manage.

marathon create swarm json file as below

{
  "id": "swarmtest",
  "cpus": 1,
  "mem": 128,
  "ports": [4375,3375],
  "args": [
      "manage",
      "--strategy","binpack",
      "--filter=health",
      "-c","mesos-experimental",
      "--cluster-opt","mesos.address=0.0.0.0",
      "--cluster-opt","mesos.port=3375",
      "--cluster-opt","mesos.tasktimeout=180s",
      "--cluster-opt","mesos.offertimeout=15s",
      "zk://192.168.33.10:2181/mesos"
      ],
  "instances": 1,
  "acceptedResourceRoles": [ "*" ],
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "dockerswarm/swarm",
      "network": "HOST",
      "forcePullImage": false
    }
  },
  "env": {
      "HOST_IP": "0.0.0.0",
      "SWARM_HOST": ":4375",
      "SWARM_MESOS_USER": "root",
      "SWARM_MESOS_TASK_TIMEOUT": "180s",
      "SWARM_MESOS_OFFER_TIMEOUT": "15s"
  }
}

error msg

[vagrant@node ~]$ docker -H :4375 run -idt --cpu-shares 1 busybox
docker: Error response from daemon: Abnormal executor termination : please verify your SWARM_MESOS_USER is correctly set.
See 'docker run --help'.
Spritekin commented 8 years ago

Hmmmmm... that one is interesting... I can imagine your docker container setup is fine otherwise you wouldn't have run swarm from marathon. Have you tried running all as root instead of the vagrant user even when it might be a superuser? (hey maybe you can try vagrant as the user).

I would think its a OS/vagrant thing. I use CoreOS so I can't give you any more advice. Maybe someone of the Swarm guys can help you a bit more. Sorry.

howardshaw commented 8 years ago

thanks @Spritekin If i try to lanuch swarm via docker instead of marathon as below,it works.

docker run -d -p 4375:2375 -p 3375:3375 -e SWARM_MESOS_USER=root \
dockerswarm/swarm \
--debug manage -c mesos-experimental \
--cluster-opt mesos.address=0.0.0.0 \
--cluster-opt mesos.port=3375 \
--cluster-opt mesos.user=root \
--cluster-opt mesos.tasktimeout=180s \
zk://192.168.33.10:2181/mesos 

logs as below

[root@master ~]# docker -H :4375 run -idt --cpu-shares 1 busybox
9681fbecd33c6b7e84dda5c81d89060f39610104bf35c20c01385a3bedb0d620
[root@master ~]# docker -H :4375 ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS                      PORTS               NAMES
9681fbecd33c        busybox             "sh"                     19 seconds ago      Exited (0) 18 seconds ago                       node/mesos-693e00c2-d625-418f-b152-fc601eee3add-S0.2558fde1-a9e6-41aa-9260-f67fcc6fd77c
Spritekin commented 8 years ago

I use to run Marathon from a container as well... try this (change to your ZK)

docker run --name marathon -e LIBPROCESS_PORT=9090 --net=host \ mesosphere/marathon:v0.15.3 --master zk://mesos-zk.aws.cmcrc.com:2181/mesos --zk zk://mesos-zk.aws.cmcrc.com:2181/marathon --checkpoint --task_launch_timeout 300000

And look at that... I forgot I also added a task timeout to marathon! Maybe thats your solution!!!

howardshaw commented 8 years ago

ok thanks @Spritekin I tried but still got the same error. So only to run swarm from docker cli will works.

Hope swarm mesos support come to production soon.

zhangxu19830126 commented 8 years ago

I also get this error,@howardshaw ,did u fix the problem?

nishanttotla commented 6 years ago

Closing due to https://github.com/docker/swarm/pull/2853