d2iq-archive / mesos-dns

DNS-based service discovery for Mesos.
https://mesosphere.github.com/mesos-dns
Apache License 2.0
483 stars 137 forks source link

mesos-dns is not working on Ubuntu 16.04.1 LTS #482

Closed groyee closed 7 years ago

groyee commented 8 years ago

After struggling for 3 days with dcos installation I made it to the dashboard but nothing really works. It shows that I have no nods, some dcos services are in red state and I have no public DNS on the dcos master.

I followed this guide:

https://dcos.io/docs/1.8/administration/installing/custom/advanced/

I used the following file to create my cluster:


bootstrap_url: http://192.168.0.11:8080 cluster_name: 'dcos-coralogix-test' exhibitor_storage_backend: static ip_detect_filename: /genconf/ip-detect master_discovery: static log_directory: /home/docker-user/genconf/logs master_list:

resolvers:

use_proxy: 'false'

+logs from the master:+

journalctl -u dcos-mesos-dns -b

Oct 15 19:09:32 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:32 resolver.go:162: Warning: Error generating records: No more masters to try; keeping old DNS state Oct 15 19:09:32 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:32 generator.go:254: Get http://127.0.1.1:5050/master/state.json: dial tcp 127.0.1.1:5050: getsockopt: connection refused Oct 15 19:09:32 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:32 generator.go:210: Failed to fetch state.json from leader. Error: Get http://127.0.1.1:5050/master/state.json: dial tcp 127.0.1.1:5050: getsockop Oct 15 19:09:32 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:32 generator.go:214: Falling back to remaining masters: [192.168.0.24:5050 192.168.0.22:5050] Oct 15 19:09:41 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:41 generator.go:254: Get http://dcosmaster2:5050/master/state.json: dial tcp: lookup dcosmaster2 on 198.51.100.3:53: read udp 198.51.100.3:60780->19 Oct 15 19:09:41 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:41 generator.go:227: Failed to fetch state.json - trying next one. Error: Get http://dcosmaster2:5050/master/state.json: dial tcp: lookup dcosmaste Oct 15 19:09:41 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:41 generator.go:254: Get http://192.168.0.22:5050/master/state.json: dial tcp 192.168.0.22:5050: getsockopt: connection refused Oct 15 19:09:41 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:41 generator.go:227: Failed to fetch state.json - trying next one. Error: Get http://192.168.0.22:5050/master/state.json: dial tcp 192.168.0.22:505 Oct 15 19:09:41 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:41 generator.go:173: Failed to fetch state.json. Error: No more masters eligible for state.json query Oct 15 19:09:41 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:09:41 resolver.go:162: Warning: Error generating records: No more masters eligible for state.json query; keeping old DNS state Oct 15 19:10:51 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:51 generator.go:254: Get http://192.168.0.24:5050/master/state.json: dial tcp 192.168.0.24:5050: getsockopt: connection refused Oct 15 19:10:51 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:51 generator.go:210: Failed to fetch state.json from leader. Error: Get http://192.168.0.24:5050/master/state.json: dial tcp 192.168.0.24:5050: get Oct 15 19:10:51 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:51 generator.go:214: Falling back to remaining masters: [192.168.0.12:5050] Oct 15 19:10:51 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:51 generator.go:254: Get http://192.168.0.24:5050/master/state.json: dial tcp 192.168.0.24:5050: getsockopt: connection refused Oct 15 19:10:51 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:51 generator.go:227: Failed to fetch state.json - trying next one. Error: Get http://192.168.0.24:5050/master/state.json: dial tcp 192.168.0.24:505 Oct 15 19:10:51 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:51 generator.go:173: Failed to fetch state.json. Error: No more masters eligible for state.json query Oct 15 19:10:51 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:51 resolver.go:162: Warning: Error generating records: No more masters eligible for state.json query; keeping old DNS state Oct 15 19:10:52 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:52 generator.go:254: Get http://192.168.0.24:5050/master/state.json: dial tcp 192.168.0.24:5050: getsockopt: connection refused Oct 15 19:10:52 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:52 generator.go:210: Failed to fetch state.json from leader. Error: Get http://192.168.0.24:5050/master/state.json: dial tcp 192.168.0.24:5050: get Oct 15 19:10:52 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:52 generator.go:173: Failed to fetch state.json. Error: No more masters to try Oct 15 19:10:52 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:52 resolver.go:162: Warning: Error generating records: No more masters to try; keeping old DNS state Oct 15 19:10:52 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:52 generator.go:254: Get http://192.168.0.24:5050/master/state.json: dial tcp 192.168.0.24:5050: getsockopt: connection refused Oct 15 19:10:52 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:52 generator.go:210: Failed to fetch state.json from leader. Error: Get http://192.168.0.24:5050/master/state.json: dial tcp 192.168.0.24:5050: get Oct 15 19:10:52 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:52 generator.go:173: Failed to fetch state.json. Error: No more masters to try Oct 15 19:10:52 dcosmaster1 mesos-dns[9384]: ERROR: 2016/10/15 19:10:52 resolver.go:162: Warning: Error generating records: No more masters to try; keeping old DNS state

tail -f /var/log

Oct 15 19:18:38 dcosmaster1 docker[938]: time="2016-10-15T19:18:38.237797700Z" level=info msg="2016/10/15 19:18:38 [INFO] memberlist: Suspect ws1 has failed, no acks received\n" Oct 15 19:18:39 dcosmaster1 java[1887]: Accepted socket connection from /127.0.0.1:49922 Oct 15 19:18:39 dcosmaster1 java[1887]: Processing ruok command from /127.0.0.1:49922 Oct 15 19:18:39 dcosmaster1 java[1887]: Closed socket connection for client /127.0.0.1:49922 (no session established for client) Oct 15 19:18:39 dcosmaster1 java[9532]: [2016-10-15 19:18:39,735] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-6) Oct 15 19:18:40 dcosmaster1 systemd[1]: dcos-navstar.service: Service hold-off time over, scheduling restart. Oct 15 19:18:40 dcosmaster1 systemd[1]: dcos-minuteman.service: Service hold-off time over, scheduling restart. Oct 15 19:18:40 dcosmaster1 systemd[1]: Stopped Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service. Oct 15 19:18:41 dcosmaster1 systemd[1]: Starting Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service... Oct 15 19:18:41 dcosmaster1 systemd[1]: Stopped Navstar: A distributed systems & network overlay orchestration engine. Oct 15 19:18:41 dcosmaster1 check-time[11580]: Checking whether time is synchronized using the kernel adjtimex API. Oct 15 19:18:41 dcosmaster1 check-time[11580]: Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.) Oct 15 19:18:41 dcosmaster1 check-time[11580]: Time is not synchronized / marked as bad by the kernel. Oct 15 19:18:41 dcosmaster1 systemd[1]: Starting Navstar: A distributed systems & network overlay orchestration engine... Oct 15 19:18:41 dcosmaster1 systemd[1]: dcos-minuteman.service: Control process exited, code=exited status=1 Oct 15 19:18:41 dcosmaster1 systemd[1]: Failed to start Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service. Oct 15 19:18:41 dcosmaster1 systemd[1]: dcos-minuteman.service: Unit entered failed state. Oct 15 19:18:41 dcosmaster1 systemd[1]: dcos-minuteman.service: Failed with result 'exit-code'. Oct 15 19:18:41 dcosmaster1 check-time[11582]: Checking whether time is synchronized using the kernel adjtimex API. Oct 15 19:18:41 dcosmaster1 check-time[11582]: Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.) Oct 15 19:18:41 dcosmaster1 check-time[11582]: Time is not synchronized / marked as bad by the kernel. Oct 15 19:18:41 dcosmaster1 systemd[1]: dcos-navstar.service: Control process exited, code=exited status=1 Oct 15 19:18:41 dcosmaster1 systemd[1]: Failed to start Navstar: A distributed systems & network overlay orchestration engine. Oct 15 19:18:41 dcosmaster1 systemd[1]: dcos-navstar.service: Unit entered failed state. Oct 15 19:18:41 dcosmaster1 systemd[1]: dcos-navstar.service: Failed with result 'exit-code'. Oct 15 19:18:41 dcosmaster1 java[1887]: Accepted socket connection from /127.0.0.1:49928 Oct 15 19:18:41 dcosmaster1 java[1887]: Processing ruok command from /127.0.0.1:49928 Oct 15 19:18:41 dcosmaster1 java[1887]: Closed socket connection for client /127.0.0.1:49928 (no session established for client) Oct 15 19:18:41 dcosmaster1 java[1887]: Accepted socket connection from /127.0.0.1:49930 Oct 15 19:18:41 dcosmaster1 java[1887]: Processing srvr command from /127.0.0.1:49930 Oct 15 19:18:41 dcosmaster1 java[1887]: Closed socket connection for client /127.0.0.1:49930 (no session established for client) Oct 15 19:18:41 dcosmaster1 java[1887]: Accepted socket connection from /127.0.0.1:49932 Oct 15 19:18:41 dcosmaster1 java[1887]: Processing ruok command from /127.0.0.1:49932 Oct 15 19:18:41 dcosmaster1 java[1887]: Closed socket connection for client /127.0.0.1:49932 (no session established for client) Oct 15 19:18:41 dcosmaster1 docker[938]: time="2016-10-15T19:18:41.413913900Z" level=info msg="2016/10/15 19:18:41 [INFO] memberlist: Marking ws1 as failed, suspect timeout reached\n" Oct 15 19:18:41 dcosmaster1 docker[938]: time="2016-10-15T19:18:41.414000200Z" level=info msg="2016/10/15 19:18:41 [INFO] serf: EventMemberFailed: ws1 192.168.0.14\n" Oct 15 19:18:44 dcosmaster1 java[1887]: Accepted socket connection from /127.0.0.1:49936 Oct 15 19:18:44 dcosmaster1 java[1887]: Processing ruok command from /127.0.0.1:49936 Oct 15 19:18:44 dcosmaster1 java[1887]: Closed socket connection for client /127.0.0.1:49936 (no session established for client) Oct 15 19:18:45 dcosmaster1 docker[938]: time="2016-10-15T19:18:45.108555000Z" level=info msg="2016/10/15 19:18:45 [INFO] serf: EventMemberJoin: ws1 192.168.0.14\n" Oct 15 19:18:45 dcosmaster1 java[1887]: Accepted socket connection from /127.0.0.1:49940 Oct 15 19:18:45 dcosmaster1 java[1887]: Processing ruok command from /127.0.0.1:49940 Oct 15 19:18:45 dcosmaster1 java[1887]: Closed socket connection for client /127.0.0.1:49940 (no session established for client) Oct 15 19:18:46 dcosmaster1 systemd[1]: dcos-minuteman.service: Service hold-off time over, scheduling restart. Oct 15 19:18:46 dcosmaster1 systemd[1]: dcos-navstar.service: Service hold-off time over, scheduling restart. Oct 15 19:18:46 dcosmaster1 systemd[1]: Stopped Navstar: A distributed systems & network overlay orchestration engine. Oct 15 19:18:46 dcosmaster1 systemd[1]: Starting Navstar: A distributed systems & network overlay orchestration engine... Oct 15 19:18:46 dcosmaster1 check-time[11595]: Checking whether time is synchronized using the kernel adjtimex API. Oct 15 19:18:46 dcosmaster1 check-time[11595]: Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.) Oct 15 19:18:46 dcosmaster1 systemd[1]: Stopped Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service. Oct 15 19:18:46 dcosmaster1 systemd[1]: Starting Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service... Oct 15 19:18:46 dcosmaster1 check-time[11595]: Time is not synchronized / marked as bad by the kernel. Oct 15 19:18:46 dcosmaster1 systemd[1]: dcos-navstar.service: Control process exited, code=exited status=1 Oct 15 19:18:46 dcosmaster1 systemd[1]: Failed to start Navstar: A distributed systems & network overlay orchestration engine. Oct 15 19:18:46 dcosmaster1 systemd[1]: dcos-navstar.service: Unit entered failed state. Oct 15 19:18:46 dcosmaster1 systemd[1]: dcos-navstar.service: Failed with result 'exit-code'. Oct 15 19:18:46 dcosmaster1 check-time[11597]: Checking whether time is synchronized using the kernel adjtimex API. Oct 15 19:18:46 dcosmaster1 check-time[11597]: Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.) Oct 15 19:18:46 dcosmaster1 check-time[11597]: Time is not synchronized / marked as bad by the kernel. Oct 15 19:18:46 dcosmaster1 systemd[1]: dcos-minuteman.service: Control process exited, code=exited status=1 Oct 15 19:18:46 dcosmaster1 systemd[1]: Failed to start Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service. Oct 15 19:18:46 dcosmaster1 systemd[1]: dcos-minuteman.service: Unit entered failed state. Oct 15 19:18:46 dcosmaster1 systemd[1]: dcos-minuteman.service: Failed with result 'exit-code'. Oct 15 19:18:47 dcosmaster1 java[9532]: [2016-10-15 19:18:47,496] INFO 192.168.0.12 - - [15/Oct/2016:19:18:47 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Resty/HTTP-Simple 0.1.0 (Lua)" (mesosphere.chaos.http.ChaosRequestLog$$EnhancerByGuice$$2351f84e:qtp1621202291-35) Oct 15 19:18:49 dcosmaster1 java[1887]: Accepted socket connection from /127.0.0.1:49946 Oct 15 19:18:49 dcosmaster1 java[1887]: Processing ruok command from /127.0.0.1:49946 Oct 15 19:18:49 dcosmaster1 java[1887]: Closed socket connection for client /127.0.0.1:49946 (no session established for client) Oct 15 19:18:50 dcosmaster1 java[1887]: Accepted socket connection from /127.0.0.1:49948 Oct 15 19:18:50 dcosmaster1 java[1887]: Processing ruok command from /127.0.0.1:49948 Oct 15 19:18:50 dcosmaster1 java[1887]: Closed socket connection for client /127.0.0.1:49948 (no session established for client) Oct 15 19:18:51 dcosmaster1 systemd[1]: dcos-navstar.service: Service hold-off time over, scheduling restart. Oct 15 19:18:51 dcosmaster1 systemd[1]: dcos-minuteman.service: Service hold-off time over, scheduling restart. Oct 15 19:18:51 dcosmaster1 systemd[1]: Stopped Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service. Oct 15 19:18:51 dcosmaster1 check-time[11609]: Checking whether time is synchronized using the kernel adjtimex API. Oct 15 19:18:51 dcosmaster1 check-time[11609]: Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.) Oct 15 19:18:51 dcosmaster1 systemd[1]: Starting Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service... Oct 15 19:18:51 dcosmaster1 systemd[1]: Stopped Navstar: A distributed systems & network overlay orchestration engine. Oct 15 19:18:51 dcosmaster1 systemd[1]: Starting Navstar: A distributed systems & network overlay orchestration engine... Oct 15 19:18:51 dcosmaster1 check-time[11611]: Checking whether time is synchronized using the kernel adjtimex API. Oct 15 19:18:51 dcosmaster1 check-time[11611]: Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.) Oct 15 19:18:51 dcosmaster1 check-time[11611]: Time is not synchronized / marked as bad by the kernel. Oct 15 19:18:51 dcosmaster1 check-time[11609]: Time is not synchronized / marked as bad by the kernel. Oct 15 19:18:51 dcosmaster1 systemd[1]: dcos-navstar.service: Control process exited, code=exited status=1 Oct 15 19:18:51 dcosmaster1 systemd[1]: Failed to start Navstar: A distributed systems & network overlay orchestration engine. Oct 15 19:18:51 dcosmaster1 systemd[1]: dcos-navstar.service: Unit entered failed state. Oct 15 19:18:51 dcosmaster1 systemd[1]: dcos-navstar.service: Failed with result 'exit-code'. Oct 15 19:18:51 dcosmaster1 systemd[1]: dcos-minuteman.service: Control process exited, code=exited status=1 Oct 15 19:18:51 dcosmaster1 systemd[1]: Failed to start Layer 4 Load Balancer: DC/OS Layer 4 Load Balancing Service. Oct 15 19:18:51 dcosmaster1 systemd[1]: dcos-minuteman.service: Unit entered failed state. Oct 15 19:18:51 dcosmaster1 systemd[1]: dcos-minuteman.service: Failed with result 'exit-code'. Oct 15 19:18:52 dcosmaster1 systemd[1]: dcos-metronome.service: Service hold-off time over, scheduling restart. Oct 15 19:18:52 dcosmaster1 systemd[1]: Stopped Jobs Service: DC/OS Metronome. Oct 15 19:18:52 dcosmaster1 systemd[1]: Starting Jobs Service: DC/OS Metronome... Oct 15 19:18:52 dcosmaster1 check-time[11617]: Checking whether time is synchronized using the kernel adjtimex API. Oct 15 19:18:52 dcosmaster1 check-time[11617]: Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.) Oct 15 19:18:52 dcosmaster1 check-time[11617]: Time is not synchronized / marked as bad by the kernel. Oct 15 19:18:52 dcosmaster1 systemd[1]: dcos-metronome.service: Control process exited, code=exited status=1 Oct 15 19:18:52 dcosmaster1 systemd[1]: Failed to start Jobs Service: DC/OS Metronome. Oct 15 19:18:52 dcosmaster1 systemd[1]: dcos-metronome.service: Unit entered failed state. Oct 15 19:18:52 dcosmaster1 systemd[1]: dcos-metronome.service: Failed with result 'exit-code'. Oct 15 19:18:53 dcosmaster1 java[1887]: Accepted socket connection from /127.0.0.1:49956 Oct 15 19:18:53 dcosmaster1 java[1887]: Processing ruok command from /127.0.0.1:49956 Oct 15 19:18:53 dcosmaster1 java[1887]: Closed socket connection for client /127.0.0.1:49956 (no session established for client) Oct 15 19:18:54 dcosmaster1 docker[938]: time="2016-10-15T19:18:54.236328700Z" level=error msg="2016/10/15 19:18:54 [ERR] memberlist: Failed TCP fallback ping: read tcp 192.168.0.24:55384->192.168.0.14:7946: i/o timeout\n"

journalctl -u dcos-mesos-master -b

Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.498433 9836 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 4.9819ms Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.498457 9836 replica.cpp:712] Persisted action at 4 Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.499660 9831 replica.cpp:691] Replica received learned notice for position 4 from @0.0.0.0:0 Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.504839 9831 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 5.1449ms Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.504926 9831 leveldb.cpp:399] Deleting ~2 keys from leveldb took 32300ns Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.504945 9831 replica.cpp:712] Persisted action at 4 Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.504956 9831 replica.cpp:697] Replica learned TRUNCATE action at position 4 Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.789049 9831 manager.cpp:556] Insert following iptables rule for overlay dcos: ipset add -exist overlay 9.0.0.0/8 nomatch && iptables -t nat -C POSTROUTIN Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.889747 9831 manager.cpp:328] Sending agent registered message to overlay-master@192.168.0.12:5050 Oct 15 19:11:37 dcosmaster1 mesos-master[9819]: I1015 19:11:37.891134 9831 manager.cpp:340] Received agent registered acknowledgment from overlay-master@192.168.0.12:5050 Oct 15 19:11:49 dcosmaster1 mesos-master[9819]: I1015 19:11:49.949265 9836 replica.cpp:537] Replica received write request for position 5 from (568)@192.168.0.12:5050 Oct 15 19:11:49 dcosmaster1 mesos-master[9819]: I1015 19:11:49.954483 9836 leveldb.cpp:341] Persisting action (508 bytes) to leveldb took 5.1821ms Oct 15 19:11:49 dcosmaster1 mesos-master[9819]: I1015 19:11:49.954514 9836 replica.cpp:712] Persisted action at 5 Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.014528 9832 replica.cpp:691] Replica received learned notice for position 5 from @0.0.0.0:0 Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.019475 9832 leveldb.cpp:341] Persisting action (510 bytes) to leveldb took 4.9177ms Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.019526 9832 replica.cpp:712] Persisted action at 5 Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.019538 9832 replica.cpp:697] Replica learned APPEND action at position 5 Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.022761 9833 replica.cpp:537] Replica received write request for position 6 from (573)@192.168.0.12:5050 Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.029242 9833 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 6165us Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.029261 9833 replica.cpp:712] Persisted action at 6 Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.030751 9836 replica.cpp:691] Replica received learned notice for position 6 from @0.0.0.0:0 Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.035810 9836 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 5.0391ms Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.035877 9836 leveldb.cpp:399] Deleting ~2 keys from leveldb took 28600ns Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.035892 9836 replica.cpp:712] Persisted action at 6 Oct 15 19:11:50 dcosmaster1 mesos-master[9819]: I1015 19:11:50.035900 9836 replica.cpp:697] Replica learned TRUNCATE action at position 6 Oct 15 19:11:57 dcosmaster1 mesos-master[9819]: I1015 19:11:57.259073 9832 manager.cpp:409] Overlay agent is already in REGISTERED state

journalctl -u dcos-marathon -b

Oct 15 19:35:31 dcosmaster1 java[9532]: [2016-10-15 19:35:31,488] INFO 192.168.0.12 - - [15/Oct/2016:19:35:31 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:35:41 dcosmaster1 java[9532]: [2016-10-15 19:35:41,095] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-39) Oct 15 19:35:41 dcosmaster1 java[9532]: [2016-10-15 19:35:41,621] INFO 192.168.0.22 - - [15/Oct/2016:19:35:41 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:35:47 dcosmaster1 java[9532]: [2016-10-15 19:35:47,020] INFO 127.0.0.1 - - [15/Oct/2016:19:35:47 +0000] "GET //127.0.0.1/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Resty/HTTP-Simp Oct 15 19:35:56 dcosmaster1 java[9532]: [2016-10-15 19:35:56,115] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-34) Oct 15 19:36:02 dcosmaster1 java[9532]: [2016-10-15 19:36:02,007] INFO 192.168.0.12 - - [15/Oct/2016:19:36:02 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:36:11 dcosmaster1 java[9532]: [2016-10-15 19:36:11,135] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-39) Oct 15 19:36:11 dcosmaster1 java[9532]: [2016-10-15 19:36:11,620] INFO 192.168.0.22 - - [15/Oct/2016:19:36:11 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:36:17 dcosmaster1 java[9532]: [2016-10-15 19:36:17,527] INFO 127.0.0.1 - - [15/Oct/2016:19:36:17 +0000] "GET //127.0.0.1/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Resty/HTTP-Simp Oct 15 19:36:26 dcosmaster1 java[9532]: [2016-10-15 19:36:26,155] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-39) Oct 15 19:36:32 dcosmaster1 java[9532]: [2016-10-15 19:36:32,501] INFO 192.168.0.12 - - [15/Oct/2016:19:36:32 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:36:41 dcosmaster1 java[9532]: [2016-10-15 19:36:41,175] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-43) Oct 15 19:36:42 dcosmaster1 java[9532]: [2016-10-15 19:36:42,599] INFO 192.168.0.22 - - [15/Oct/2016:19:36:42 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:36:48 dcosmaster1 java[9532]: [2016-10-15 19:36:48,021] INFO 127.0.0.1 - - [15/Oct/2016:19:36:48 +0000] "GET //127.0.0.1/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Resty/HTTP-Simp Oct 15 19:36:56 dcosmaster1 java[9532]: [2016-10-15 19:36:56,195] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-36) Oct 15 19:37:03 dcosmaster1 java[9532]: [2016-10-15 19:37:03,006] INFO 192.168.0.12 - - [15/Oct/2016:19:37:03 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:37:11 dcosmaster1 java[9532]: [2016-10-15 19:37:11,215] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-35) Oct 15 19:37:13 dcosmaster1 java[9532]: [2016-10-15 19:37:13,613] INFO 192.168.0.22 - - [15/Oct/2016:19:37:13 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:37:18 dcosmaster1 java[9532]: [2016-10-15 19:37:18,282] INFO 127.0.0.1 - - [15/Oct/2016:19:37:18 +0000] "GET //127.0.0.1/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Resty/HTTP-Simp Oct 15 19:37:26 dcosmaster1 java[9532]: [2016-10-15 19:37:26,235] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-36) Oct 15 19:37:33 dcosmaster1 java[9532]: [2016-10-15 19:37:33,493] INFO 192.168.0.12 - - [15/Oct/2016:19:37:33 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:37:41 dcosmaster1 java[9532]: [2016-10-15 19:37:41,254] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-13) Oct 15 19:37:43 dcosmaster1 java[9532]: [2016-10-15 19:37:43,690] INFO 192.168.0.22 - - [15/Oct/2016:19:37:43 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest Oct 15 19:37:48 dcosmaster1 java[9532]: [2016-10-15 19:37:48,782] INFO 127.0.0.1 - - [15/Oct/2016:19:37:48 +0000] "GET //127.0.0.1/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Resty/HTTP-Simp Oct 15 19:37:56 dcosmaster1 java[9532]: [2016-10-15 19:37:56,275] INFO detected skipped heartbeat (mesosphere.marathon.core.heartbeat.Heartbeat$:marathon-akka.actor.default-dispatcher-44) Oct 15 19:38:04 dcosmaster1 java[9532]: [2016-10-15 19:38:04,003] INFO 192.168.0.12 - - [15/Oct/2016:19:38:04 +0000] "GET //192.168.0.24:8080/v2/apps?embed=apps.tasks&label=DCOS_SERVICE_NAME HTTP/1.1" 200 11 "-" "Rest lines 697-723/723 (END)

ChiefAlexander commented 8 years ago

Not sure if you have made any progress on this. I think the important log messages that you are seeing is:

Oct 15 19:18:51 dcosmaster1 check-time[11609]: Checking whether time is synchronized using the kernel adjtimex API.
Oct 15 19:18:51 dcosmaster1 check-time[11609]: Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.)
Oct 15 19:18:51 dcosmaster1 check-time[11611]: Time is not synchronized / marked as bad by the kernel.
Oct 15 19:18:51 dcosmaster1 check-time[11609]: Time is not synchronized / marked as bad by the kernel.

Have you synchronized your time on your node with (ntpd, chrony, systemd-timesyncd, etc.)

tobilg commented 7 years ago

DC/OS installation is not officially supported on Ubuntu. Apart from that, the NTP problems that @ChiefAlexander pointed out are probably the failure reason.

jdef commented 7 years ago

appears to be unrelated to mesos-dns, closing out. please re-open if you disagree