Closed gred7 closed 8 years ago
passed 20 minutes - still activating
still activating
@gred7 Anything interesting in deisctl journal store-gateway
?
If the logs show Waiting for ceph gateway on 8888/tcp...
forever, Ive found that a deisctl restart store-daemon
usually does the trick. Id love to know the underlying reason for that, but Ive yet to find it.
this is what i have from ceph -s:
root@b7eb0356c687:/app# ceph -s
2015-10-21 13:49:36.924866 7fd86c115700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858000cd0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd858004f70).fault
2015-10-21 13:49:36.924866 7fd86c115700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858000cd0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd858004f70).fault
2015-10-21 13:49:45.926118 7fd86c216700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858007000 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd85800b2a0).fault
2015-10-21 13:49:45.926118 7fd86c216700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858007000 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd85800b2a0).fault
2015-10-21 13:49:51.926730 7fd86c317700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858007000 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd85800b2a0).fault
2015-10-21 13:49:51.926730 7fd86c317700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858007000 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd85800b2a0).fault
^CError connecting to cluster: InterruptedOrTimeoutError
deisctl journal store-gateway -- Logs begin at Wed 2015-10-21 04:48:10 UTC. -- Oct 21 13:53:31 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:31 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:32 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:32 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:33 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:33 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:34 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:34 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:35 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:35 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:36 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:36 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused
@gred7 Are all your store-monitors up and running correctly? Until theres enough of them up to form quorum, youll see output like what you pasted - since ceph isnt healthy enough to form output.
@iancoffey could you please provide me a step by step instruction? everything i tried for now is not leading to anything successful it is a FRESH installation. and it does not start
@gred7 sure thing. first just checking deisctl list
and seeing the status of the platform would be helpful, especially making sure the deis-store-* components are all active.
If the monitor components are active, then checking each of them in the list with a docker logs deis-store-monitor
would be a good next step.
Also, I noticed from your initial comment that you did an uninstall, install and start - but did you stop the old platform before you did the install and start? That could have led to some weirdness possibly.
@iancoffey, it was stopped before i started a reinstall... it is what i have now [greg@localhost aws]$ deisctl list UNIT MACHINE LOAD ACTIVE SUB deis-builder.service de26925d.../10.21.2.223 loaded inactive dead deis-controller.service fc92b7c1.../10.21.2.224 loaded inactive dead deis-database.service de26925d.../10.21.2.223 loaded inactive dead deis-logger.service de26925d.../10.21.2.223 loaded inactive dead deis-logspout.service 123f6e0f.../10.21.1.96 loaded inactive dead deis-logspout.service de26925d.../10.21.2.223 loaded inactive dead deis-logspout.service fc92b7c1.../10.21.2.224 loaded inactive dead deis-publisher.service 123f6e0f.../10.21.1.96 loaded inactive dead deis-publisher.service de26925d.../10.21.2.223 loaded inactive dead deis-publisher.service fc92b7c1.../10.21.2.224 loaded inactive dead deis-registry@1.service 123f6e0f.../10.21.1.96 loaded inactive dead deis-router@1.service 123f6e0f.../10.21.1.96 loaded inactive dead deis-router@2.service de26925d.../10.21.2.223 loaded inactive dead deis-router@3.service fc92b7c1.../10.21.2.224 loaded inactive dead deis-store-admin.service 123f6e0f.../10.21.1.96 loaded active running deis-store-admin.service de26925d.../10.21.2.223 loaded active running deis-store-admin.service fc92b7c1.../10.21.2.224 loaded active running deis-store-daemon.service 123f6e0f.../10.21.1.96 loaded active running deis-store-daemon.service de26925d.../10.21.2.223 loaded active running deis-store-daemon.service fc92b7c1.../10.21.2.224 loaded active running deis-store-gateway@1.service de26925d.../10.21.2.223 loaded activating start-post deis-store-metadata.service 123f6e0f.../10.21.1.96 loaded active running deis-store-metadata.service de26925d.../10.21.2.223 loaded active running deis-store-metadata.service fc92b7c1.../10.21.2.224 loaded active running deis-store-monitor.service 123f6e0f.../10.21.1.96 loaded activating auto-restart deis-store-monitor.service de26925d.../10.21.2.223 loaded active running deis-store-monitor.service fc92b7c1.../10.21.2.224 loaded active running deis-store-volume.service 123f6e0f.../10.21.1.96 loaded activating start-pre deis-store-volume.service de26925d.../10.21.2.223 loaded activating start-pre deis-store-volume.service fc92b7c1.../10.21.2.224 loaded activating start-pre
@gred7 looking at the deisctl list
output, I think the issue is this unit isnt started properly:
deis-store-monitor.service 123f6e0f.../10.21.1.96 loaded activating auto-restart
If you are able to get the docker logs from that instance of deis-store-monitor, we might be able to see why its not starting correctly. If you can get that store-monitor running, the ceph cluster would probably become healthy and then the rest of the platform could start.
ok, look... i just did:
Previously your issue was probably that the monitor on your 10.21.1.96 server was not started correctly. Now that the platforms been reinstalled, that may have changed so going back through the process is necessary. Check to see if ceph -s
still is not functioning. If it still returns only errors, try another deisctl list
and if another of the store-monitor units is listed as not being "loaded active running", you should shell to that instance and do a docker logs deis-store-monitor
to see whats up.
core@ckan1 ~ $ fleetctl list-units UNIT MACHINE ACTIVE SUB deis-builder.service fc92b7c1.../10.21.2.224 inactive dead deis-controller.service 123f6e0f.../10.21.1.96 inactive dead deis-database.service de26925d.../10.21.2.223 inactive dead deis-logger.service de26925d.../10.21.2.223 inactive dead deis-logspout.service 123f6e0f.../10.21.1.96 inactive dead deis-logspout.service de26925d.../10.21.2.223 inactive dead deis-logspout.service fc92b7c1.../10.21.2.224 inactive dead deis-publisher.service 123f6e0f.../10.21.1.96 inactive dead deis-publisher.service de26925d.../10.21.2.223 inactive dead deis-publisher.service fc92b7c1.../10.21.2.224 inactive dead deis-registry@1.service fc92b7c1.../10.21.2.224 inactive dead deis-router@1.service de26925d.../10.21.2.223 inactive dead deis-router@2.service 123f6e0f.../10.21.1.96 inactive dead deis-router@3.service fc92b7c1.../10.21.2.224 inactive dead deis-store-daemon.service 123f6e0f.../10.21.1.96 active running deis-store-daemon.service de26925d.../10.21.2.223 active running deis-store-daemon.service fc92b7c1.../10.21.2.224 active running deis-store-gateway@1.service 123f6e0f.../10.21.1.96 activating start-post deis-store-metadata.service 123f6e0f.../10.21.1.96 active running deis-store-metadata.service de26925d.../10.21.2.223 active running deis-store-metadata.service fc92b7c1.../10.21.2.224 active running deis-store-monitor.service 123f6e0f.../10.21.1.96 activating auto-restart deis-store-monitor.service de26925d.../10.21.2.223 active running deis-store-monitor.service fc92b7c1.../10.21.2.224 active running deis-store-volume.service 123f6e0f.../10.21.1.96 inactive dead deis-store-volume.service de26925d.../10.21.2.223 inactive dead deis-store-volume.service fc92b7c1.../10.21.2.224 inactive dead
the monitor still not staring correctly on .96 the ~ # docker logs deis-store-monitor ~ # (outputs nothing)
When shelled into that .96 host, you might get something valuable from journalctl -u deis-store-monitor.service
Oct 21 16:29:39 ckan1 sh[3334]: -2/-2 (syslog threshold)
Oct 21 16:29:39 ckan1 systemd[1]: deis-store-monitor.service: main process exited, code=exited, status=1/FAILURE
Oct 21 16:29:39 ckan1 systemd[1]: Unit deis-store-monitor.service entered failed state.
Oct 21 16:29:39 ckan1 systemd[1]: deis-store-monitor.service failed.
Oct 21 16:29:44 ckan1 systemd[1]: deis-store-monitor.service holdoff time over, scheduling restart.
Oct 21 16:29:44 ckan1 systemd[1]: Starting deis-store-monitor...
Oct 21 16:29:44 ckan1 systemd[1]: Started deis-store-monitor.
Oct 21 16:29:44 ckan1 systemd[1]: deis-store-monitor.service: main process exited, code=exited, status=1/FAILURE
Oct 21 16:29:44 ckan1 systemd[1]: Unit deis-store-monitor.service entered failed state.
Oct 21 16:29:44 ckan1 systemd[1]: deis-store-monitor.service failed.
Oct 21 16:29:50 ckan1 systemd[1]: deis-store-monitor.service holdoff time over, scheduling restart.
Oct 21 16:29:50 ckan1 systemd[1]: Starting deis-store-monitor...
Oct 21 16:29:50 ckan1 systemd[1]: Started deis-store-monitor.
Oct 21 16:29:50 ckan1 systemd[1]: deis-store-monitor.service: main process exited, code=exited, status=1/FAILURE
Oct 21 16:29:50 ckan1 systemd[1]: Unit deis-store-monitor.service entered failed state.
Oct 21 16:29:50 ckan1 systemd[1]: deis-store-monitor.service failed.
Oct 21 16:29:55 ckan1 systemd[1]: deis-store-monitor.service holdoff time over, scheduling restart.
Oct 21 16:29:55 ckan1 systemd[1]: Starting deis-store-monitor...
Oct 21 16:29:55 ckan1 systemd[1]: Started deis-store-monitor.
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.210288 7f297a0ce8c0 0 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b), process ceph-mon, pid 1
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.231578 7f297a0ce8c0 0 mon.ckan1 does not exist in monmap, will attempt to join an existing cluster
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.231784 7f297a0ce8c0 0 using public_addr 10.21.1.96:6789/0 -> 10.21.1.96:6789/0
Oct 21 16:29:56 ckan1 sh[3948]: starting mon.ckan1 rank -1 at 10.21.1.96:6789/0 mon_data /var/lib/ceph/mon/ceph-ckan1 fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.231837 7f297a0ce8c0 0 starting mon.ckan1 rank -1 at 10.21.1.96:6789/0 mon_data /var/lib/ceph/mon/ceph-ckan1 fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.232266 7f297a0ce8c0 1 mon.ckan1@-1(probing) e0 preinit fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.232982 7f297a0ce8c0 1 mon.ckan1@-1(probing) e0 initial_members ip-10-21-1-96.ec2.internal, filtering seed monmap
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.234528 7f297a0ce8c0 0 mon.ckan1@-1(probing) e0 my rank is now 0 (was -1)
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.234874 7f297a0ce8c0 1 mon.ckan1@0(probing) e0 win_standalone_election
Oct 21 16:29:56 ckan1 sh[3948]: mon/Monitor.cc: In function 'void Monitor::win_standalone_election()' thread 7f297a0ce8c0 time 2015-10-21 16:29:56.237343
Oct 21 16:29:56 ckan1 sh[3948]: mon/Monitor.cc: 1796: FAILED assert(rank == 0)
Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
Oct 21 16:29:56 ckan1 sh[3948]: 1: (ceph::ceph_assertfail(char const, char const, int, char const)+0x8b) [0x7df47b]
Oct 21 16:29:56 ckan1 sh[3948]: 2: (Monitor::win_standalone_election()+0x218) [0x5c38d8]
Oct 21 16:29:56 ckan1 sh[3948]: 3: (Monitor::bootstrap()+0x9bb) [0x5c42eb]
Oct 21 16:29:56 ckan1 sh[3948]: 4: (Monitor::init()+0xd5) [0x5c4645]
Oct 21 16:29:56 ckan1 sh[3948]: 5: (main()+0x2470) [0x5769c0]
Oct 21 16:29:56 ckan1 sh[3948]: 6: (libc_start_main()+0xf5) [0x7f2977655ec5]
Oct 21 16:29:56 ckan1 sh[3948]: 7: /usr/bin/ceph-mon() [0x5984f7]
Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable>
is needed to interpret this.
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.238262 7f297a0ce8c0 -1 mon/Monitor.cc: In function 'void Monitor::win_standalone_election()' thread 7f297a0ce8c0 time 2015-10-21 16:29:56.237343
Oct 21 16:29:56 ckan1 sh[3948]: mon/Monitor.cc: 1796: FAILED assert(rank == 0)
Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
Oct 21 16:29:56 ckan1 sh[3948]: 1: (ceph::ceph_assertfail(char const, char const, int, char const)+0x8b) [0x7df47b]
Oct 21 16:29:56 ckan1 sh[3948]: 2: (Monitor::win_standalone_election()+0x218) [0x5c38d8]
Oct 21 16:29:56 ckan1 sh[3948]: 3: (Monitor::bootstrap()+0x9bb) [0x5c42eb]
Oct 21 16:29:56 ckan1 sh[3948]: 4: (Monitor::init()+0xd5) [0x5c4645]
Oct 21 16:29:56 ckan1 sh[3948]: 5: (main()+0x2470) [0x5769c0]
Oct 21 16:29:56 ckan1 sh[3948]: 6: (libc_start_main()+0xf5) [0x7f2977655ec5]
Oct 21 16:29:56 ckan1 sh[3948]: 7: /usr/bin/ceph-mon() [0x5984f7]
Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable>
is needed to interpret this.
Oct 21 16:29:56 ckan1 sh[3948]: --- begin dump of recent events ---
Oct 21 16:29:56 ckan1 sh[3948]: -48> 2015-10-21 16:29:56.207960 7f297a0ce8c0 5 asok(0x4d24000) register_command perfcounters_dump hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -47> 2015-10-21 16:29:56.208014 7f297a0ce8c0 5 asok(0x4d24000) register_command 1 hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -46> 2015-10-21 16:29:56.208024 7f297a0ce8c0 5 asok(0x4d24000) register_command perf dump hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -45> 2015-10-21 16:29:56.208032 7f297a0ce8c0 5 asok(0x4d24000) register_command perfcounters_schema hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -44> 2015-10-21 16:29:56.208041 7f297a0ce8c0 5 asok(0x4d24000) register_command 2 hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -43> 2015-10-21 16:29:56.208046 7f297a0ce8c0 5 asok(0x4d24000) register_command perf schema hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -42> 2015-10-21 16:29:56.208054 7f297a0ce8c0 5 asok(0x4d24000) register_command perf reset hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -41> 2015-10-21 16:29:56.208059 7f297a0ce8c0 5 asok(0x4d24000) register_command config show hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -40> 2015-10-21 16:29:56.208067 7f297a0ce8c0 5 asok(0x4d24000) register_command config set hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -39> 2015-10-21 16:29:56.208072 7f297a0ce8c0 5 asok(0x4d24000) register_command config get hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -38> 2015-10-21 16:29:56.208080 7f297a0ce8c0 5 asok(0x4d24000) register_command config diff hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -37> 2015-10-21 16:29:56.208084 7f297a0ce8c0 5 asok(0x4d24000) register_command log flush hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -36> 2015-10-21 16:29:56.208092 7f297a0ce8c0 5 asok(0x4d24000) register_command log dump hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -35> 2015-10-21 16:29:56.208103 7f297a0ce8c0 5 asok(0x4d24000) register_command log reopen hook 0x4cac050
Oct 21 16:29:56 ckan1 sh[3948]: -34> 2015-10-21 16:29:56.210288 7f297a0ce8c0 0 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b), process ceph-mon, pid 1
Oct 21 16:29:56 ckan1 sh[3948]: -33> 2015-10-21 16:29:56.212847 7f297a0ce8c0 5 asok(0x4d24000) init /var/run/ceph/ceph-mon.ckan1.asok
Oct 21 16:29:56 ckan1 sh[3948]: -32> 2015-10-21 16:29:56.212864 7f297a0ce8c0 5 asok(0x4d24000) bind_and_listen /var/run/ceph/ceph-mon.ckan1.asok
Oct 21 16:29:56 ckan1 sh[3948]: -31> 2015-10-21 16:29:56.212929 7f297a0ce8c0 5 asok(0x4d24000) register_command 0 hook 0x4ca80b8
Oct 21 16:29:56 ckan1 sh[3948]: -30> 2015-10-21 16:29:56.212942 7f297a0ce8c0 5 asok(0x4d24000) register_command version hook 0x4ca80b8
Oct 21 16:29:56 ckan1 sh[3948]: -29> 2015-10-21 16:29:56.212947 7f297a0ce8c0 5 asok(0x4d24000) register_command git_version hook 0x4ca80b8
Oct 21 16:29:56 ckan1 sh[3948]: -28> 2015-10-21 16:29:56.212957 7f297a0ce8c0 5 asok(0x4d24000) register_command help hook 0x4cac0b0
Oct 21 16:29:56 ckan1 sh[3948]: -27> 2015-10-21 16:29:56.212963 7f297a0ce8c0 5 asok(0x4d24000) register_command get_command_descriptions hook 0x4cac150
Oct 21 16:29:56 ckan1 sh[3948]: -26> 2015-10-21 16:29:56.213027 7f2975447700 5 asok(0x4d24000) entry start
Oct 21 16:29:56 ckan1 sh[3948]: -25> 2015-10-21 16:29:56.231578 7f297a0ce8c0 0 mon.ckan1 does not exist in monmap, will attempt to join an existing cluster
Oct 21 16:29:56 ckan1 sh[3948]: -24> 2015-10-21 16:29:56.231784 7f297a0ce8c0 0 using public_addr 10.21.1.96:6789/0 -> 10.21.1.96:6789/0
Oct 21 16:29:56 ckan1 sh[3948]: -23> 2015-10-21 16:29:56.231837 7f297a0ce8c0 0 starting mon.ckan1 rank -1 at 10.21.1.96:6789/0 mon_data /var/lib/ceph/mon/ceph-ckan1 fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848
Oct 21 16:29:56 ckan1 sh[3948]: -22> 2015-10-21 16:29:56.231993 7f297a0ce8c0 1 -- 10.21.1.96:6789/0 learned my addr 10.21.1.96:6789/0
Oct 21 16:29:56 ckan1 sh[3948]: -21> 2015-10-21 16:29:56.232004 7f297a0ce8c0 1 accepter.accepter.bind my_inst.addr is 10.21.1.96:6789/0 need_addr=0
Oct 21 16:29:56 ckan1 sh[3948]: -20> 2015-10-21 16:29:56.232160 7f297a0ce8c0 5 adding auth protocol: cephx
Oct 21 16:29:56 ckan1 sh[3948]: -19> 2015-10-21 16:29:56.232166 7f297a0ce8c0 5 adding auth protocol: cephx
Oct 21 16:29:56 ckan1 sh[3948]: -18> 2015-10-21 16:29:56.232182 7f297a0ce8c0 10 log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: daemon prio: info)
Oct 21 16:29:56 ckan1 sh[3948]: -17> 2015-10-21 16:29:56.232187 7f297a0ce8c0 10 log_channel(audit) update_config to_monitors: true to_syslog: false syslog_facility: local0 prio: info)
Oct 21 16:29:56 ckan1 sh[3948]: -16> 2015-10-21 16:29:56.232266 7f297a0ce8c0 1 mon.ckan1@-1(probing) e0 preinit fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848
Oct 21 16:29:56 ckan1 sh[3948]: -15> 2015-10-21 16:29:56.232982 7f297a0ce8c0 1 mon.ckan1@-1(probing) e0 initial_members ip-10-21-1-96.ec2.internal, filtering seed monmap
Oct 21 16:29:56 ckan1 sh[3948]: -14> 2015-10-21 16:29:56.232990 7f297a0ce8c0 1 keeping ip-10-21-1-96.ec2.internal 10.21.1.96:6789/0
Oct 21 16:29:56 ckan1 sh[3948]: -13> 2015-10-21 16:29:56.233665 7f297a0ce8c0 2 auth: KeyRing::load: loaded key file /var/lib/ceph/mon/ceph-ckan1/keyring
Oct 21 16:29:56 ckan1 sh[3948]: -12> 2015-10-21 16:29:56.233710 7f297a0ce8c0 5 asok(0x4d24000) register_command mon_status hook 0x4cac1a0
Oct 21 16:29:56 ckan1 sh[3948]: -11> 2015-10-21 16:29:56.233735 7f297a0ce8c0 5 asok(0x4d24000) register_command quorum_status hook 0x4cac1a0
Oct 21 16:29:56 ckan1 sh[3948]: -10> 2015-10-21 16:29:56.233755 7f297a0ce8c0 5 asok(0x4d24000) register_command sync_force hook 0x4cac1a0
Oct 21 16:29:56 ckan1 sh[3948]: -9> 2015-10-21 16:29:56.233776 7f297a0ce8c0 5 asok(0x4d24000) register_command add_bootstrap_peer_hint hook 0x4cac1a0
Oct 21 16:29:56 ckan1 sh[3948]: -8> 2015-10-21 16:29:56.233797 7f297a0ce8c0 5 asok(0x4d24000) register_command quorum enter hook 0x4cac1a0
Oct 21 16:29:56 ckan1 sh[3948]: -7> 2015-10-21 16:29:56.233819 7f297a0ce8c0 5 asok(0x4d24000) register_command quorum exit hook 0x4cac1a0
Oct 21 16:29:56 ckan1 sh[3948]: -6> 2015-10-21 16:29:56.233846 7f297a0ce8c0 1 -- 10.21.1.96:6789/0 messenger.start
Oct 21 16:29:56 ckan1 sh[3948]: -5> 2015-10-21 16:29:56.233974 7f297a0ce8c0 2 mon.ckan1@-1(probing) e0 init
Oct 21 16:29:56 ckan1 sh[3948]: -4> 2015-10-21 16:29:56.234449 7f297a0ce8c0 1 accepter.accepter.start
Oct 21 16:29:56 ckan1 sh[3948]: -3> 2015-10-21 16:29:56.234528 7f297a0ce8c0 0 mon.ckan1@-1(probing) e0 my rank is now 0 (was -1)
Oct 21 16:29:56 ckan1 sh[3948]: -2> 2015-10-21 16:29:56.234832 7f297a0ce8c0 1 -- 10.21.1.96:6789/0 mark_down_all
Oct 21 16:29:56 ckan1 sh[3948]: -1> 2015-10-21 16:29:56.234874 7f297a0ce8c0 1 mon.ckan1@0(probing) e0 win_standalone_election
Oct 21 16:29:56 ckan1 sh[3948]: 0> 2015-10-21 16:29:56.238262 7f297a0ce8c0 -1 mon/Monitor.cc: In function 'void Monitor::win_standalone_election()' thread 7f297a0ce8c0 time 2015-10-21 16:29:56.237343
Oct 21 16:29:56 ckan1 sh[3948]: mon/Monitor.cc: 1796: FAILED assert(rank == 0)
Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
Oct 21 16:29:56 ckan1 sh[3948]: 1: (ceph::ceph_assertfail(char const, char const, int, char const)+0x8b) [0x7df47b]
Oct 21 16:29:56 ckan1 sh[3948]: 2: (Monitor::win_standalone_election()+0x218) [0x5c38d8]
Oct 21 16:29:56 ckan1 sh[3948]: 3: (Monitor::bootstrap()+0x9bb) [0x5c42eb]
Oct 21 16:29:56 ckan1 sh[3948]: 4: (Monitor::init()+0xd5) [0x5c4645]
Oct 21 16:29:56 ckan1 sh[3948]: 5: (main()+0x2470) [0x5769c0]
Oct 21 16:29:56 ckan1 sh[3948]: 6: (__libc_start_main()+0xf5) [0x7f2977655ec5]
Oct 21 16:29:56 ckan1 sh[3948]: 7: /usr/bin/ceph-mon() [0x5984f7]
Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable>
is needed to interpret this.
Oct 21 16:29:56 ckan1 sh[3948]: --- logging levels ---
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 none
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 lockdep
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 context
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 crush
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_balancer
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_locker
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_log
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_log_expire
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_migrator
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 buffer
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 timer
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 filer
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 striper
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 objecter
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rados
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rbd
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rbd_replay
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 journaler
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 objectcacher
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 client
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 osd
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 optracker
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 objclass
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 filestore
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 keyvaluestore
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 journal
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 ms
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mon
Oct 21 16:29:56 ckan1 sh[3948]: 0/10 monc
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 paxos
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 tp
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 auth
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 crypto
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 finisher
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 heartbeatmap
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 perfcounter
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 rgw
Oct 21 16:29:56 ckan1 sh[3948]: 1/10 civetweb
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 javaclient
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 asok
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 throttle
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 0 refs
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 xio
Oct 21 16:29:56 ckan1 sh[3948]: -2/-2 (syslog threshold)
Oct 21 16:29:56 ckan1 sh[3948]: 99/99 (stderr threshold)
Oct 21 16:29:56 ckan1 sh[3948]: max_recent 10000
Oct 21 16:29:56 ckan1 sh[3948]: max_new 1000
Oct 21 16:29:56 ckan1 sh[3948]: logfile
Oct 21 16:29:56 ckan1 sh[3948]: --- end dump of recent events ---
Oct 21 16:29:56 ckan1 sh[3948]: terminate called after throwing an instance of 'ceph::FailedAssertion'
Oct 21 16:29:56 ckan1 sh[3948]: ** Caught signal (Aborted) **
Oct 21 16:29:56 ckan1 sh[3948]: in thread 7f297a0ce8c0
Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
Oct 21 16:29:56 ckan1 sh[3948]: 1: /usr/bin/ceph-mon() [0x9a98aa]
Oct 21 16:29:56 ckan1 sh[3948]: 2: (()+0x10340) [0x7f29791cb340]
Oct 21 16:29:56 ckan1 sh[3948]: 3: (gsignal()+0x39) [0x7f297766acc9]
Oct 21 16:29:56 ckan1 sh[3948]: 4: (abort()+0x148) [0x7f297766e0d8]
Oct 21 16:29:56 ckan1 sh[3948]: 5: (gnu_cxx::verbose_terminate_handler()+0x155) [0x7f2977f75535]
Oct 21 16:29:56 ckan1 sh[3948]: 6: (()+0x5e6d6) [0x7f2977f736d6]
Oct 21 16:29:56 ckan1 sh[3948]: 7: (()+0x5e703) [0x7f2977f73703]
Oct 21 16:29:56 ckan1 sh[3948]: 8: (()+0x5e922) [0x7f2977f73922]
Oct 21 16:29:56 ckan1 sh[3948]: 9: (ceph::ceph_assertfail(char const, char const, int, char const)+0x278) [0x7df668]
Oct 21 16:29:56 ckan1 sh[3948]: 10: (Monitor::win_standalone_election()+0x218) [0x5c38d8]
Oct 21 16:29:56 ckan1 sh[3948]: 11: (Monitor::bootstrap()+0x9bb) [0x5c42eb]
Oct 21 16:29:56 ckan1 sh[3948]: 12: (Monitor::init()+0xd5) [0x5c4645]
Oct 21 16:29:56 ckan1 sh[3948]: 13: (main()+0x2470) [0x5769c0]
Oct 21 16:29:56 ckan1 sh[3948]: 14: (libc_startmain()+0xf5) [0x7f2977655ec5]
Oct 21 16:29:56 ckan1 sh[3948]: 15: /usr/bin/ceph-mon() [0x5984f7]
Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.241876 7f297a0ce8c0 -1 ** Caught signal (Aborted) **
Oct 21 16:29:56 ckan1 sh[3948]: in thread 7f297a0ce8c0
Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
Oct 21 16:29:56 ckan1 sh[3948]: 1: /usr/bin/ceph-mon() [0x9a98aa]
Oct 21 16:29:56 ckan1 sh[3948]: 2: (()+0x10340) [0x7f29791cb340]
Oct 21 16:29:56 ckan1 sh[3948]: 3: (gsignal()+0x39) [0x7f297766acc9]
Oct 21 16:29:56 ckan1 sh[3948]: 4: (abort()+0x148) [0x7f297766e0d8]
Oct 21 16:29:56 ckan1 sh[3948]: 5: (__gnu_cxx::verbose_terminate_handler()+0x155) [0x7f2977f75535]
Oct 21 16:29:56 ckan1 sh[3948]: 6: (()+0x5e6d6) [0x7f2977f736d6]
Oct 21 16:29:56 ckan1 sh[3948]: 7: (()+0x5e703) [0x7f2977f73703]
Oct 21 16:29:56 ckan1 sh[3948]: 8: (()+0x5e922) [0x7f2977f73922]
Oct 21 16:29:56 ckan1 sh[3948]: 9: (ceph::ceph_assertfail(char const, char const, int, char const)+0x278) [0x7df668]
Oct 21 16:29:56 ckan1 sh[3948]: 10: (Monitor::win_standalone_election()+0x218) [0x5c38d8]
Oct 21 16:29:56 ckan1 sh[3948]: 11: (Monitor::bootstrap()+0x9bb) [0x5c42eb]
Oct 21 16:29:56 ckan1 sh[3948]: 12: (Monitor::init()+0xd5) [0x5c4645]
Oct 21 16:29:56 ckan1 sh[3948]: 13: (main()+0x2470) [0x5769c0]
Oct 21 16:29:56 ckan1 sh[3948]: 14: (__libc_startmain()+0xf5) [0x7f2977655ec5]
Oct 21 16:29:56 ckan1 sh[3948]: 15: /usr/bin/ceph-mon() [0x5984f7]
Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable>
is needed to interpret this.
Oct 21 16:29:56 ckan1 sh[3948]: --- begin dump of recent events ---
Oct 21 16:29:56 ckan1 sh[3948]: 0> 2015-10-21 16:29:56.241876 7f297a0ce8c0 -1 ** Caught signal (Aborted) **
Oct 21 16:29:56 ckan1 sh[3948]: in thread 7f297a0ce8c0
Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
Oct 21 16:29:56 ckan1 sh[3948]: 1: /usr/bin/ceph-mon() [0x9a98aa]
Oct 21 16:29:56 ckan1 sh[3948]: 2: (()+0x10340) [0x7f29791cb340]
Oct 21 16:29:56 ckan1 sh[3948]: 3: (gsignal()+0x39) [0x7f297766acc9]
Oct 21 16:29:56 ckan1 sh[3948]: 4: (abort()+0x148) [0x7f297766e0d8]
Oct 21 16:29:56 ckan1 sh[3948]: 5: (gnu_cxx::verbose_terminate_handler()+0x155) [0x7f2977f75535]
Oct 21 16:29:56 ckan1 sh[3948]: 6: (()+0x5e6d6) [0x7f2977f736d6]
Oct 21 16:29:56 ckan1 sh[3948]: 7: (()+0x5e703) [0x7f2977f73703]
Oct 21 16:29:56 ckan1 sh[3948]: 8: (()+0x5e922) [0x7f2977f73922]
Oct 21 16:29:56 ckan1 sh[3948]: 9: (ceph::ceph_assertfail(char const, char const_, int, char const*)+0x278) [0x7df668]
Oct 21 16:29:56 ckan1 sh[3948]: 10: (Monitor::win_standalone_election()+0x218) [0x5c38d8]
Oct 21 16:29:56 ckan1 sh[3948]: 11: (Monitor::bootstrap()+0x9bb) [0x5c42eb]
Oct 21 16:29:56 ckan1 sh[3948]: 12: (Monitor::init()+0xd5) [0x5c4645]
Oct 21 16:29:56 ckan1 sh[3948]: 13: (main()+0x2470) [0x5769c0]
Oct 21 16:29:56 ckan1 sh[3948]: 14: (__libc_start_main()+0xf5) [0x7f2977655ec5]
Oct 21 16:29:56 ckan1 sh[3948]: 15: /usr/bin/ceph-mon() [0x5984f7]
Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable>
is needed to interpret this.
Oct 21 16:29:56 ckan1 sh[3948]: --- logging levels ---
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 none
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 lockdep
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 context
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 crush
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_balancer
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_locker
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_log
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_log_expire
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_migrator
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 buffer
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 timer
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 filer
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 striper
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 objecter
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rados
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rbd
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rbd_replay
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 journaler
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 objectcacher
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 client
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 osd
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 optracker
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 objclass
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 filestore
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 keyvaluestore
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 journal
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 ms
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mon
Oct 21 16:29:56 ckan1 sh[3948]: 0/10 monc
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 paxos
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 tp
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 auth
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 crypto
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 finisher
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 heartbeatmap
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 perfcounter
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 rgw
Oct 21 16:29:56 ckan1 sh[3948]: 1/10 civetweb
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 javaclient
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 asok
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 throttle
Oct 21 16:29:56 ckan1 sh[3948]: 0/ 0 refs
Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 xio
Oct 21 16:29:56 ckan1 sh[3948]: -2/-2 (syslog threshold)
Oct 21 16:29:56 ckan1 sh[3948]: 99/99 (stderr threshold)
Oct 21 16:29:56 ckan1 sh[3948]: max_recent 10000
Oct 21 16:29:56 ckan1 sh[3948]: max_new 1000
Oct 21 16:29:56 ckan1 sh[3948]: log_file
Oct 21 16:29:56 ckan1 sh[3948]: --- end dump of recent events ---
Oct 21 16:29:56 ckan1 systemd[1]: deis-store-monitor.service: main process exited, code=exited, status=1/FAILURE
Oct 21 16:29:56 ckan1 systemd[1]: Unit deis-store-monitor.service entered failed state.
Oct 21 16:29:56 ckan1 systemd[1]: deis-store-monitor.service failed.
Oct 21 16:29:56 ckan1 sh[3948]: reraise_fatal: default handler for signal 6 didn't terminate the process?
From the logs, I was able to find http://tracker.ceph.com/issues/3005. From the looks of it this bug occurs when someone removes a monitor while ceph is still booting. Unfortunately this is something that's out of our control/likely wasn't caused through the default setup (read: we haven't hit this bug before in our test infrastructure). I don't think there's a better remedy than killing everything and starting from scratch again.
ok, look... i just did:
- deisctl stop platfom
- deisctl uninstall platform
- ssh to one of the machines in cluster
- fleetctl list-units
- fleetctl destroy for every unit present
- docker ps
- docker kill for every container still present
Normally just running deisctl stop platform && deisctl uninstall platform
should do that all for you.
@bacongobbler
it is not a normal case, but I can also do it the way you prefer:
[greg@localhost] ~$ deisctl stop platform && deisctl uninstall platform
● ▴ ■
■ ● ▴ Stopping Deis...
▴ ■ ●
Router mesh...
deis-router@3.service: inactive/dead
deis-router@2.service: inactive/dead
deis-router@1.service: inactive/dead
Data plane...
deis-publisher.service: inactive/dead
Control plane...
deis-builder.service: inactive/dead
deis-registry@1.service: inactive/dead
deis-controller.service: inactive/dead
deis-database.service: inactive/dead
Logging subsystem...
deis-logspout.service: inactive/dead
deis-logger.service: inactive/dead
Storage subsystem...
The service 'deis-store-gateway@1.service' failed while stopping.
deis-store-volume.service: inactive/dead
deis-store-metadata.service: inactive/dead
deis-store-daemon.service: inactive/dead
deis-store-monitor.service: inactive/dead
Done.
Please run deisctl start platform
to restart Deis.
● ▴ ■
■ ● ▴ Uninstalling Deis...
▴ ■ ●
Router mesh...
deis-router@3.service: destroyed
deis-router@2.service: destroyed
deis-router@1.service: destroyed
Data plane...
deis-publisher.service: destroyed
Control plane...
deis-registry@1.service: destroyed
deis-controller.service: destroyed
deis-database.service: destroyed
deis-builder.service: destroyed
Logging subsystem...
deis-logspout.service: destroyed
deis-logger.service: destroyed
Storage subsystem...
deis-store-volume.service: destroyed
deis-store-gateway@1.service: destroyed
deis-store-metadata.service: destroyed
deis-store-daemon.service: destroyed
deis-store-monitor.service: destroyed
Done.
[greg@localhost aws]$ deictl install platform && deisctl start platform
● ▴ ■
■ ● ▴ Installing Deis...
▴ ■ ●
Storage subsystem...
deis-store-daemon.service: loaded
deis-store-volume.service: loaded
deis-store-metadata.service: loaded
deis-store-monitor.service: loaded
deis-store-gateway@1.service: loaded
Logging subsystem...
deis-logger.service: loaded
deis-logspout.service: loaded
Control plane...
deis-registry@1.service: loaded
deis-builder.service: loaded
deis-database.service: loaded
deis-controller.service: loaded
Data plane...
deis-publisher.service: loaded
Router mesh...
deis-router@2.service: loaded
deis-router@3.service: loaded
deis-router@1.service: loaded
Done.
Please run deisctl start platform
to boot up Deis.
● ▴ ■
■ ● ▴ Starting Deis...
▴ ■ ●
Storage subsystem...
deis-store-monitor.service: active/running
deis-store-daemon.service: active/running
deis-store-metadata.service: active/running
deis-store-gateway@1.service: activating/start-post
(hanging)
anything?
@gred7 The errors from your Ceph daemons make me think that somehow, either the IPs of the hosts changed or the data in Ceph was removed. Ceph is now confused and is panicking on startup.
I'd recommend destroying the hosts and starting over if that is at all possible, and installing the latest version following our documentation.
@carmstrong but then, what if it will happen again after some time? how data is written in the ceph? may be I could restore or see what changed?
but then, what if it will happen again after some time?
Something had to have happened here to cause this - we've never seen this before, and it doesn't occur during normal operation. Did you restart hosts? Remove hosts from the cluster? Did their IPs change? Were the data containers accidentally removed?
Did you restart hosts? - no. Remove hosts from the cluster? - no. Did their IPs change? - no. Were the data containers accidentally removed? - may be, how can i check for sure? may be I could restore them?
Ok, I am closing this issue. It reveals that we need Ceph-less setup anyway.
I've the same troubles. First time it was a failure for deis-gateway@1, then I fixed that somehow and got a failure for deis-builder. We have to handle those errors better because I've found > 10 unresolved issues with the same symptoms.
@DenisIzmaylov we're currently building Deis v2 on top of kubernetes without Ceph. A lot of these issues have gone away by using a more sophisticated scheduler with service discovery built in. Most (if not all) of these issues have gone away.
If you're up for testing v2, we have a temporary location for v2 docs available at http://docs-v2.readthedocs.org/en/latest/. :)
Wow, thank you for extra fast response. When do you plan to release v2? Generally I'm afraid to install betas or even more first major releases. I prefer to install any one of first minor releases (e.g. 2.1, 1.3, etc) if I do not have a time for research and contributing.
I'm not aware of what our target is for a stable release. ping @carmstrong
Btw, I have firewall configuration from Deis: https://raw.githubusercontent.com/deis/deis/master/contrib/util/custom-firewall.sh
Ceph has Network Reference: http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
Also Ceph has Preflight Checklist.
Is that's all OK here? Because:
wwwprod@www1 ~ $ nse deis-store-monitor
root@www1:/# ceph -s
2016-03-04 18:40:22.365990 7f6c3c132700 0 -- :/1000089 >> 10.91.119.xxx:6789/0 pipe(0x7f6c38064010 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6c3805c560).fault
2016-03-04 18:40:22.365990 7f6c3c132700 0 -- :/1000089 >> 10.91.119.xxx:6789/0 pipe(0x7f6c38064010 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6c3805c560).fault
2016-03-04 18:40:25.366045 7f6c34ded700 0 -- :/1000089 >> 10.91.119.yyy:6789/0 pipe(0x7f6c2c000c00 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6c2c004ef0).fault
2016-03-04 18:40:25.366045 7f6c34ded700 0 -- :/1000089 >> 10.91.119.yyy:6789/0 pipe(0x7f6c2c000c00 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6c2c004ef0).fault
The firewall allows internal communication across the nodes in your fleet cluster, and blocks all incoming connections except on ports 22,2222,80 and 443. See https://github.com/deis/deis/blob/master/contrib/util/custom-firewall.sh#L37-L51.
The document you showed just recommends a way to set up the network topology between nodes, which in that case then yes it "respects" that configuration since all communication between nodes is allowed.
I asked this question because when I try to connect between nodes I get a connection refused error:
wwwprod@www1 ~ $ curl 10.91.119.xxx:6789
curl: (7) Failed to connect to 10.91.119.xxx port 6789: Connection refused
wwwprod@www1 ~ $ curl 10.91.119.yyy:6789
curl: (7) Failed to connect to 10.91.119.yyy port 6789: Connection refused
wwwprod@www1 ~ $ curl 10.91.119.zzz:6789
curl: (7) Failed to connect to 10.91.119.zzz port 6789: Connection refused
@DenisIzmaylov can you please open a separate issue? Make sure you include what cloud provider you provisioned on, what version, reproduction steps to setup your cluster, etc. It's likely that your fleet cluster did not start properly and this line was not able to determine the IP addresses in your cluster, which would block all communication across the cluster. It's best to start tackling this issue separately so we can better determine your issue. Thanks!
Yes, sure. Thank you for your fast and detailed answers.
I have uninstalled Deis, destroyed all Docker containers, removed all Docker images on each node:
deisctl stop platform
deisctl uninstall platform
fleetctl stop deis-store-admin.service
docker rm $(docker ps -a -q)
docker rmi $(docker images | grep "deis" | awk '{print $3}')
And now I will reboot each node and try to install Deis again. If I will get errors - I will create new issue.
UPDATED: nothing. :\
UPDATED2: I have found http://docs.deis.io/en/latest/managing_deis/recovering-ceph-quorum/#recovering-ceph-quorum and going to follow this instruction.
Hello, Somehow I screwed my test deis installation to no avail (deis-controller.service was restarting too fast) so i've decided to reinstall it
i did: deisctl uninstall platform - went ok deisctl install platform - went ok deisctl start platform - stuck in deis-store-gateway@1.service: activating/start-post and no way it is go further. i remember in the beginning it has take some 5-10 minutes to start the whole platform. so the question is how long it should/could take, and is there something i can possibly do to check where does it stuck?