deis / deis

Deis v1, the CoreOS and Docker PaaS: Your PaaS. Your Rules.
https://deis.com/docs/
MIT License
6.05k stars 798 forks source link

how long deis-store-gateway@1.service: activating/start-post should take? #4646

Closed gred7 closed 8 years ago

gred7 commented 9 years ago

Hello, Somehow I screwed my test deis installation to no avail (deis-controller.service was restarting too fast) so i've decided to reinstall it

i did: deisctl uninstall platform - went ok deisctl install platform - went ok deisctl start platform - stuck in deis-store-gateway@1.service: activating/start-post and no way it is go further. i remember in the beginning it has take some 5-10 minutes to start the whole platform. so the question is how long it should/could take, and is there something i can possibly do to check where does it stuck?

gred7 commented 9 years ago

passed 20 minutes - still activating

gred7 commented 9 years ago

still activating

iancoffey commented 9 years ago

@gred7 Anything interesting in deisctl journal store-gateway?

If the logs show Waiting for ceph gateway on 8888/tcp... forever, Ive found that a deisctl restart store-daemon usually does the trick. Id love to know the underlying reason for that, but Ive yet to find it.

gred7 commented 9 years ago

this is what i have from ceph -s: root@b7eb0356c687:/app# ceph -s
2015-10-21 13:49:36.924866 7fd86c115700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858000cd0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd858004f70).fault 2015-10-21 13:49:36.924866 7fd86c115700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858000cd0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd858004f70).fault 2015-10-21 13:49:45.926118 7fd86c216700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858007000 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd85800b2a0).fault 2015-10-21 13:49:45.926118 7fd86c216700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858007000 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd85800b2a0).fault 2015-10-21 13:49:51.926730 7fd86c317700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858007000 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd85800b2a0).fault 2015-10-21 13:49:51.926730 7fd86c317700 0 -- 10.21.1.96:0/1000324 >> 10.21.1.96:6789/0 pipe(0x7fd858007000 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7fd85800b2a0).fault ^CError connecting to cluster: InterruptedOrTimeoutError

gred7 commented 9 years ago

deisctl journal store-gateway -- Logs begin at Wed 2015-10-21 04:48:10 UTC. -- Oct 21 13:53:31 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:31 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:32 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:32 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:33 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:33 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:34 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:34 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:35 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:35 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused Oct 21 13:53:36 ckan2 sh[32677]: Waiting for ceph gateway on 8888/tcp... Oct 21 13:53:36 ckan2 sh[32677]: curl: (7) Failed connect to localhost:8888; Connection refused

iancoffey commented 9 years ago

@gred7 Are all your store-monitors up and running correctly? Until theres enough of them up to form quorum, youll see output like what you pasted - since ceph isnt healthy enough to form output.

gred7 commented 9 years ago

@iancoffey could you please provide me a step by step instruction? everything i tried for now is not leading to anything successful it is a FRESH installation. and it does not start

iancoffey commented 9 years ago

@gred7 sure thing. first just checking deisctl list and seeing the status of the platform would be helpful, especially making sure the deis-store-* components are all active.

If the monitor components are active, then checking each of them in the list with a docker logs deis-store-monitor would be a good next step.

Also, I noticed from your initial comment that you did an uninstall, install and start - but did you stop the old platform before you did the install and start? That could have led to some weirdness possibly.

gred7 commented 9 years ago

@iancoffey, it was stopped before i started a reinstall... it is what i have now [greg@localhost aws]$ deisctl list UNIT MACHINE LOAD ACTIVE SUB deis-builder.service de26925d.../10.21.2.223 loaded inactive dead deis-controller.service fc92b7c1.../10.21.2.224 loaded inactive dead deis-database.service de26925d.../10.21.2.223 loaded inactive dead deis-logger.service de26925d.../10.21.2.223 loaded inactive dead deis-logspout.service 123f6e0f.../10.21.1.96 loaded inactive dead deis-logspout.service de26925d.../10.21.2.223 loaded inactive dead deis-logspout.service fc92b7c1.../10.21.2.224 loaded inactive dead deis-publisher.service 123f6e0f.../10.21.1.96 loaded inactive dead deis-publisher.service de26925d.../10.21.2.223 loaded inactive dead deis-publisher.service fc92b7c1.../10.21.2.224 loaded inactive dead deis-registry@1.service 123f6e0f.../10.21.1.96 loaded inactive dead deis-router@1.service 123f6e0f.../10.21.1.96 loaded inactive dead deis-router@2.service de26925d.../10.21.2.223 loaded inactive dead deis-router@3.service fc92b7c1.../10.21.2.224 loaded inactive dead deis-store-admin.service 123f6e0f.../10.21.1.96 loaded active running deis-store-admin.service de26925d.../10.21.2.223 loaded active running deis-store-admin.service fc92b7c1.../10.21.2.224 loaded active running deis-store-daemon.service 123f6e0f.../10.21.1.96 loaded active running deis-store-daemon.service de26925d.../10.21.2.223 loaded active running deis-store-daemon.service fc92b7c1.../10.21.2.224 loaded active running deis-store-gateway@1.service de26925d.../10.21.2.223 loaded activating start-post deis-store-metadata.service 123f6e0f.../10.21.1.96 loaded active running deis-store-metadata.service de26925d.../10.21.2.223 loaded active running deis-store-metadata.service fc92b7c1.../10.21.2.224 loaded active running deis-store-monitor.service 123f6e0f.../10.21.1.96 loaded activating auto-restart deis-store-monitor.service de26925d.../10.21.2.223 loaded active running deis-store-monitor.service fc92b7c1.../10.21.2.224 loaded active running deis-store-volume.service 123f6e0f.../10.21.1.96 loaded activating start-pre deis-store-volume.service de26925d.../10.21.2.223 loaded activating start-pre deis-store-volume.service fc92b7c1.../10.21.2.224 loaded activating start-pre

iancoffey commented 9 years ago

@gred7 looking at the deisctl list output, I think the issue is this unit isnt started properly:

deis-store-monitor.service 123f6e0f.../10.21.1.96 loaded activating auto-restart

If you are able to get the docker logs from that instance of deis-store-monitor, we might be able to see why its not starting correctly. If you can get that store-monitor running, the ceph cluster would probably become healthy and then the rest of the platform could start.

gred7 commented 9 years ago

ok, look... i just did:

  1. deisctl stop platfom
  2. deisctl uninstall platform
  3. ssh to one of the machines in cluster
  4. fleetctl list-units
  5. fleetctl destroy for every unit present
  6. docker ps
  7. docker kill for every container still present
  8. back to my workstation
  9. deisctl install platform
  10. deisctl start platform ● ▴ ■ ■ ● ▴ Starting Deis... ▴ ■ ● Storage subsystem... deis-store-monitor.service: active/running
    deis-store-daemon.service: active/running
    deis-store-metadata.service: active/running
    deis-store-gateway@1.service: activating/start-post
    What should I do? how can i look into logs?
iancoffey commented 9 years ago

Previously your issue was probably that the monitor on your 10.21.1.96 server was not started correctly. Now that the platforms been reinstalled, that may have changed so going back through the process is necessary. Check to see if ceph -s still is not functioning. If it still returns only errors, try another deisctl list and if another of the store-monitor units is listed as not being "loaded active running", you should shell to that instance and do a docker logs deis-store-monitor to see whats up.

gred7 commented 9 years ago

core@ckan1 ~ $ fleetctl list-units UNIT MACHINE ACTIVE SUB deis-builder.service fc92b7c1.../10.21.2.224 inactive dead deis-controller.service 123f6e0f.../10.21.1.96 inactive dead deis-database.service de26925d.../10.21.2.223 inactive dead deis-logger.service de26925d.../10.21.2.223 inactive dead deis-logspout.service 123f6e0f.../10.21.1.96 inactive dead deis-logspout.service de26925d.../10.21.2.223 inactive dead deis-logspout.service fc92b7c1.../10.21.2.224 inactive dead deis-publisher.service 123f6e0f.../10.21.1.96 inactive dead deis-publisher.service de26925d.../10.21.2.223 inactive dead deis-publisher.service fc92b7c1.../10.21.2.224 inactive dead deis-registry@1.service fc92b7c1.../10.21.2.224 inactive dead deis-router@1.service de26925d.../10.21.2.223 inactive dead deis-router@2.service 123f6e0f.../10.21.1.96 inactive dead deis-router@3.service fc92b7c1.../10.21.2.224 inactive dead deis-store-daemon.service 123f6e0f.../10.21.1.96 active running deis-store-daemon.service de26925d.../10.21.2.223 active running deis-store-daemon.service fc92b7c1.../10.21.2.224 active running deis-store-gateway@1.service 123f6e0f.../10.21.1.96 activating start-post deis-store-metadata.service 123f6e0f.../10.21.1.96 active running deis-store-metadata.service de26925d.../10.21.2.223 active running deis-store-metadata.service fc92b7c1.../10.21.2.224 active running deis-store-monitor.service 123f6e0f.../10.21.1.96 activating auto-restart deis-store-monitor.service de26925d.../10.21.2.223 active running deis-store-monitor.service fc92b7c1.../10.21.2.224 active running deis-store-volume.service 123f6e0f.../10.21.1.96 inactive dead deis-store-volume.service de26925d.../10.21.2.223 inactive dead deis-store-volume.service fc92b7c1.../10.21.2.224 inactive dead

the monitor still not staring correctly on .96 the ~ # docker logs deis-store-monitor ~ # (outputs nothing)

iancoffey commented 9 years ago

When shelled into that .96 host, you might get something valuable from journalctl -u deis-store-monitor.service

gred7 commented 9 years ago

always like this, see anything useful for debug?

Oct 21 16:29:39 ckan1 sh[3334]: -2/-2 (syslog threshold) Oct 21 16:29:39 ckan1 systemd[1]: deis-store-monitor.service: main process exited, code=exited, status=1/FAILURE Oct 21 16:29:39 ckan1 systemd[1]: Unit deis-store-monitor.service entered failed state. Oct 21 16:29:39 ckan1 systemd[1]: deis-store-monitor.service failed. Oct 21 16:29:44 ckan1 systemd[1]: deis-store-monitor.service holdoff time over, scheduling restart. Oct 21 16:29:44 ckan1 systemd[1]: Starting deis-store-monitor... Oct 21 16:29:44 ckan1 systemd[1]: Started deis-store-monitor. Oct 21 16:29:44 ckan1 systemd[1]: deis-store-monitor.service: main process exited, code=exited, status=1/FAILURE Oct 21 16:29:44 ckan1 systemd[1]: Unit deis-store-monitor.service entered failed state. Oct 21 16:29:44 ckan1 systemd[1]: deis-store-monitor.service failed. Oct 21 16:29:50 ckan1 systemd[1]: deis-store-monitor.service holdoff time over, scheduling restart. Oct 21 16:29:50 ckan1 systemd[1]: Starting deis-store-monitor... Oct 21 16:29:50 ckan1 systemd[1]: Started deis-store-monitor. Oct 21 16:29:50 ckan1 systemd[1]: deis-store-monitor.service: main process exited, code=exited, status=1/FAILURE Oct 21 16:29:50 ckan1 systemd[1]: Unit deis-store-monitor.service entered failed state. Oct 21 16:29:50 ckan1 systemd[1]: deis-store-monitor.service failed. Oct 21 16:29:55 ckan1 systemd[1]: deis-store-monitor.service holdoff time over, scheduling restart. Oct 21 16:29:55 ckan1 systemd[1]: Starting deis-store-monitor... Oct 21 16:29:55 ckan1 systemd[1]: Started deis-store-monitor. Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.210288 7f297a0ce8c0 0 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b), process ceph-mon, pid 1 Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.231578 7f297a0ce8c0 0 mon.ckan1 does not exist in monmap, will attempt to join an existing cluster Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.231784 7f297a0ce8c0 0 using public_addr 10.21.1.96:6789/0 -> 10.21.1.96:6789/0 Oct 21 16:29:56 ckan1 sh[3948]: starting mon.ckan1 rank -1 at 10.21.1.96:6789/0 mon_data /var/lib/ceph/mon/ceph-ckan1 fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848 Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.231837 7f297a0ce8c0 0 starting mon.ckan1 rank -1 at 10.21.1.96:6789/0 mon_data /var/lib/ceph/mon/ceph-ckan1 fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848 Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.232266 7f297a0ce8c0 1 mon.ckan1@-1(probing) e0 preinit fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848 Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.232982 7f297a0ce8c0 1 mon.ckan1@-1(probing) e0 initial_members ip-10-21-1-96.ec2.internal, filtering seed monmap Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.234528 7f297a0ce8c0 0 mon.ckan1@-1(probing) e0 my rank is now 0 (was -1) Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.234874 7f297a0ce8c0 1 mon.ckan1@0(probing) e0 win_standalone_election Oct 21 16:29:56 ckan1 sh[3948]: mon/Monitor.cc: In function 'void Monitor::win_standalone_election()' thread 7f297a0ce8c0 time 2015-10-21 16:29:56.237343 Oct 21 16:29:56 ckan1 sh[3948]: mon/Monitor.cc: 1796: FAILED assert(rank == 0) Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Oct 21 16:29:56 ckan1 sh[3948]: 1: (ceph::ceph_assertfail(char const, char const, int, char const)+0x8b) [0x7df47b] Oct 21 16:29:56 ckan1 sh[3948]: 2: (Monitor::win_standalone_election()+0x218) [0x5c38d8] Oct 21 16:29:56 ckan1 sh[3948]: 3: (Monitor::bootstrap()+0x9bb) [0x5c42eb] Oct 21 16:29:56 ckan1 sh[3948]: 4: (Monitor::init()+0xd5) [0x5c4645] Oct 21 16:29:56 ckan1 sh[3948]: 5: (main()+0x2470) [0x5769c0] Oct 21 16:29:56 ckan1 sh[3948]: 6: (libc_start_main()+0xf5) [0x7f2977655ec5] Oct 21 16:29:56 ckan1 sh[3948]: 7: /usr/bin/ceph-mon() [0x5984f7] Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.238262 7f297a0ce8c0 -1 mon/Monitor.cc: In function 'void Monitor::win_standalone_election()' thread 7f297a0ce8c0 time 2015-10-21 16:29:56.237343 Oct 21 16:29:56 ckan1 sh[3948]: mon/Monitor.cc: 1796: FAILED assert(rank == 0) Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Oct 21 16:29:56 ckan1 sh[3948]: 1: (ceph::ceph_assertfail(char const, char const, int, char const)+0x8b) [0x7df47b] Oct 21 16:29:56 ckan1 sh[3948]: 2: (Monitor::win_standalone_election()+0x218) [0x5c38d8] Oct 21 16:29:56 ckan1 sh[3948]: 3: (Monitor::bootstrap()+0x9bb) [0x5c42eb] Oct 21 16:29:56 ckan1 sh[3948]: 4: (Monitor::init()+0xd5) [0x5c4645] Oct 21 16:29:56 ckan1 sh[3948]: 5: (main()+0x2470) [0x5769c0] Oct 21 16:29:56 ckan1 sh[3948]: 6: (libc_start_main()+0xf5) [0x7f2977655ec5] Oct 21 16:29:56 ckan1 sh[3948]: 7: /usr/bin/ceph-mon() [0x5984f7] Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Oct 21 16:29:56 ckan1 sh[3948]: --- begin dump of recent events --- Oct 21 16:29:56 ckan1 sh[3948]: -48> 2015-10-21 16:29:56.207960 7f297a0ce8c0 5 asok(0x4d24000) register_command perfcounters_dump hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -47> 2015-10-21 16:29:56.208014 7f297a0ce8c0 5 asok(0x4d24000) register_command 1 hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -46> 2015-10-21 16:29:56.208024 7f297a0ce8c0 5 asok(0x4d24000) register_command perf dump hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -45> 2015-10-21 16:29:56.208032 7f297a0ce8c0 5 asok(0x4d24000) register_command perfcounters_schema hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -44> 2015-10-21 16:29:56.208041 7f297a0ce8c0 5 asok(0x4d24000) register_command 2 hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -43> 2015-10-21 16:29:56.208046 7f297a0ce8c0 5 asok(0x4d24000) register_command perf schema hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -42> 2015-10-21 16:29:56.208054 7f297a0ce8c0 5 asok(0x4d24000) register_command perf reset hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -41> 2015-10-21 16:29:56.208059 7f297a0ce8c0 5 asok(0x4d24000) register_command config show hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -40> 2015-10-21 16:29:56.208067 7f297a0ce8c0 5 asok(0x4d24000) register_command config set hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -39> 2015-10-21 16:29:56.208072 7f297a0ce8c0 5 asok(0x4d24000) register_command config get hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -38> 2015-10-21 16:29:56.208080 7f297a0ce8c0 5 asok(0x4d24000) register_command config diff hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -37> 2015-10-21 16:29:56.208084 7f297a0ce8c0 5 asok(0x4d24000) register_command log flush hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -36> 2015-10-21 16:29:56.208092 7f297a0ce8c0 5 asok(0x4d24000) register_command log dump hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -35> 2015-10-21 16:29:56.208103 7f297a0ce8c0 5 asok(0x4d24000) register_command log reopen hook 0x4cac050 Oct 21 16:29:56 ckan1 sh[3948]: -34> 2015-10-21 16:29:56.210288 7f297a0ce8c0 0 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b), process ceph-mon, pid 1 Oct 21 16:29:56 ckan1 sh[3948]: -33> 2015-10-21 16:29:56.212847 7f297a0ce8c0 5 asok(0x4d24000) init /var/run/ceph/ceph-mon.ckan1.asok Oct 21 16:29:56 ckan1 sh[3948]: -32> 2015-10-21 16:29:56.212864 7f297a0ce8c0 5 asok(0x4d24000) bind_and_listen /var/run/ceph/ceph-mon.ckan1.asok Oct 21 16:29:56 ckan1 sh[3948]: -31> 2015-10-21 16:29:56.212929 7f297a0ce8c0 5 asok(0x4d24000) register_command 0 hook 0x4ca80b8 Oct 21 16:29:56 ckan1 sh[3948]: -30> 2015-10-21 16:29:56.212942 7f297a0ce8c0 5 asok(0x4d24000) register_command version hook 0x4ca80b8 Oct 21 16:29:56 ckan1 sh[3948]: -29> 2015-10-21 16:29:56.212947 7f297a0ce8c0 5 asok(0x4d24000) register_command git_version hook 0x4ca80b8 Oct 21 16:29:56 ckan1 sh[3948]: -28> 2015-10-21 16:29:56.212957 7f297a0ce8c0 5 asok(0x4d24000) register_command help hook 0x4cac0b0 Oct 21 16:29:56 ckan1 sh[3948]: -27> 2015-10-21 16:29:56.212963 7f297a0ce8c0 5 asok(0x4d24000) register_command get_command_descriptions hook 0x4cac150 Oct 21 16:29:56 ckan1 sh[3948]: -26> 2015-10-21 16:29:56.213027 7f2975447700 5 asok(0x4d24000) entry start Oct 21 16:29:56 ckan1 sh[3948]: -25> 2015-10-21 16:29:56.231578 7f297a0ce8c0 0 mon.ckan1 does not exist in monmap, will attempt to join an existing cluster Oct 21 16:29:56 ckan1 sh[3948]: -24> 2015-10-21 16:29:56.231784 7f297a0ce8c0 0 using public_addr 10.21.1.96:6789/0 -> 10.21.1.96:6789/0 Oct 21 16:29:56 ckan1 sh[3948]: -23> 2015-10-21 16:29:56.231837 7f297a0ce8c0 0 starting mon.ckan1 rank -1 at 10.21.1.96:6789/0 mon_data /var/lib/ceph/mon/ceph-ckan1 fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848 Oct 21 16:29:56 ckan1 sh[3948]: -22> 2015-10-21 16:29:56.231993 7f297a0ce8c0 1 -- 10.21.1.96:6789/0 learned my addr 10.21.1.96:6789/0 Oct 21 16:29:56 ckan1 sh[3948]: -21> 2015-10-21 16:29:56.232004 7f297a0ce8c0 1 accepter.accepter.bind my_inst.addr is 10.21.1.96:6789/0 need_addr=0 Oct 21 16:29:56 ckan1 sh[3948]: -20> 2015-10-21 16:29:56.232160 7f297a0ce8c0 5 adding auth protocol: cephx Oct 21 16:29:56 ckan1 sh[3948]: -19> 2015-10-21 16:29:56.232166 7f297a0ce8c0 5 adding auth protocol: cephx Oct 21 16:29:56 ckan1 sh[3948]: -18> 2015-10-21 16:29:56.232182 7f297a0ce8c0 10 log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: daemon prio: info) Oct 21 16:29:56 ckan1 sh[3948]: -17> 2015-10-21 16:29:56.232187 7f297a0ce8c0 10 log_channel(audit) update_config to_monitors: true to_syslog: false syslog_facility: local0 prio: info) Oct 21 16:29:56 ckan1 sh[3948]: -16> 2015-10-21 16:29:56.232266 7f297a0ce8c0 1 mon.ckan1@-1(probing) e0 preinit fsid 38660b10-3da5-4a99-b4cf-dd8f4011c848 Oct 21 16:29:56 ckan1 sh[3948]: -15> 2015-10-21 16:29:56.232982 7f297a0ce8c0 1 mon.ckan1@-1(probing) e0 initial_members ip-10-21-1-96.ec2.internal, filtering seed monmap Oct 21 16:29:56 ckan1 sh[3948]: -14> 2015-10-21 16:29:56.232990 7f297a0ce8c0 1 keeping ip-10-21-1-96.ec2.internal 10.21.1.96:6789/0 Oct 21 16:29:56 ckan1 sh[3948]: -13> 2015-10-21 16:29:56.233665 7f297a0ce8c0 2 auth: KeyRing::load: loaded key file /var/lib/ceph/mon/ceph-ckan1/keyring Oct 21 16:29:56 ckan1 sh[3948]: -12> 2015-10-21 16:29:56.233710 7f297a0ce8c0 5 asok(0x4d24000) register_command mon_status hook 0x4cac1a0 Oct 21 16:29:56 ckan1 sh[3948]: -11> 2015-10-21 16:29:56.233735 7f297a0ce8c0 5 asok(0x4d24000) register_command quorum_status hook 0x4cac1a0 Oct 21 16:29:56 ckan1 sh[3948]: -10> 2015-10-21 16:29:56.233755 7f297a0ce8c0 5 asok(0x4d24000) register_command sync_force hook 0x4cac1a0 Oct 21 16:29:56 ckan1 sh[3948]: -9> 2015-10-21 16:29:56.233776 7f297a0ce8c0 5 asok(0x4d24000) register_command add_bootstrap_peer_hint hook 0x4cac1a0 Oct 21 16:29:56 ckan1 sh[3948]: -8> 2015-10-21 16:29:56.233797 7f297a0ce8c0 5 asok(0x4d24000) register_command quorum enter hook 0x4cac1a0 Oct 21 16:29:56 ckan1 sh[3948]: -7> 2015-10-21 16:29:56.233819 7f297a0ce8c0 5 asok(0x4d24000) register_command quorum exit hook 0x4cac1a0 Oct 21 16:29:56 ckan1 sh[3948]: -6> 2015-10-21 16:29:56.233846 7f297a0ce8c0 1 -- 10.21.1.96:6789/0 messenger.start Oct 21 16:29:56 ckan1 sh[3948]: -5> 2015-10-21 16:29:56.233974 7f297a0ce8c0 2 mon.ckan1@-1(probing) e0 init Oct 21 16:29:56 ckan1 sh[3948]: -4> 2015-10-21 16:29:56.234449 7f297a0ce8c0 1 accepter.accepter.start Oct 21 16:29:56 ckan1 sh[3948]: -3> 2015-10-21 16:29:56.234528 7f297a0ce8c0 0 mon.ckan1@-1(probing) e0 my rank is now 0 (was -1) Oct 21 16:29:56 ckan1 sh[3948]: -2> 2015-10-21 16:29:56.234832 7f297a0ce8c0 1 -- 10.21.1.96:6789/0 mark_down_all Oct 21 16:29:56 ckan1 sh[3948]: -1> 2015-10-21 16:29:56.234874 7f297a0ce8c0 1 mon.ckan1@0(probing) e0 win_standalone_election Oct 21 16:29:56 ckan1 sh[3948]: 0> 2015-10-21 16:29:56.238262 7f297a0ce8c0 -1 mon/Monitor.cc: In function 'void Monitor::win_standalone_election()' thread 7f297a0ce8c0 time 2015-10-21 16:29:56.237343 Oct 21 16:29:56 ckan1 sh[3948]: mon/Monitor.cc: 1796: FAILED assert(rank == 0) Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Oct 21 16:29:56 ckan1 sh[3948]: 1: (ceph::ceph_assertfail(char const, char const, int, char const)+0x8b) [0x7df47b] Oct 21 16:29:56 ckan1 sh[3948]: 2: (Monitor::win_standalone_election()+0x218) [0x5c38d8] Oct 21 16:29:56 ckan1 sh[3948]: 3: (Monitor::bootstrap()+0x9bb) [0x5c42eb] Oct 21 16:29:56 ckan1 sh[3948]: 4: (Monitor::init()+0xd5) [0x5c4645] Oct 21 16:29:56 ckan1 sh[3948]: 5: (main()+0x2470) [0x5769c0] Oct 21 16:29:56 ckan1 sh[3948]: 6: (__libc_start_main()+0xf5) [0x7f2977655ec5] Oct 21 16:29:56 ckan1 sh[3948]: 7: /usr/bin/ceph-mon() [0x5984f7] Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Oct 21 16:29:56 ckan1 sh[3948]: --- logging levels --- Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 none Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 lockdep Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 context Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 crush Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_balancer Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_locker Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_log Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_log_expire Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_migrator Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 buffer Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 timer Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 filer Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 striper Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 objecter Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rados Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rbd Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rbd_replay Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 journaler Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 objectcacher Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 client Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 osd Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 optracker Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 objclass Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 filestore Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 keyvaluestore Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 journal Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 ms Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mon Oct 21 16:29:56 ckan1 sh[3948]: 0/10 monc Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 paxos Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 tp Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 auth Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 crypto Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 finisher Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 heartbeatmap Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 perfcounter Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 rgw Oct 21 16:29:56 ckan1 sh[3948]: 1/10 civetweb Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 javaclient Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 asok Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 throttle Oct 21 16:29:56 ckan1 sh[3948]: 0/ 0 refs Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 xio Oct 21 16:29:56 ckan1 sh[3948]: -2/-2 (syslog threshold) Oct 21 16:29:56 ckan1 sh[3948]: 99/99 (stderr threshold) Oct 21 16:29:56 ckan1 sh[3948]: max_recent 10000 Oct 21 16:29:56 ckan1 sh[3948]: max_new 1000 Oct 21 16:29:56 ckan1 sh[3948]: logfile Oct 21 16:29:56 ckan1 sh[3948]: --- end dump of recent events --- Oct 21 16:29:56 ckan1 sh[3948]: terminate called after throwing an instance of 'ceph::FailedAssertion' Oct 21 16:29:56 ckan1 sh[3948]: ** Caught signal (Aborted) ** Oct 21 16:29:56 ckan1 sh[3948]: in thread 7f297a0ce8c0 Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Oct 21 16:29:56 ckan1 sh[3948]: 1: /usr/bin/ceph-mon() [0x9a98aa] Oct 21 16:29:56 ckan1 sh[3948]: 2: (()+0x10340) [0x7f29791cb340] Oct 21 16:29:56 ckan1 sh[3948]: 3: (gsignal()+0x39) [0x7f297766acc9] Oct 21 16:29:56 ckan1 sh[3948]: 4: (abort()+0x148) [0x7f297766e0d8] Oct 21 16:29:56 ckan1 sh[3948]: 5: (gnu_cxx::verbose_terminate_handler()+0x155) [0x7f2977f75535] Oct 21 16:29:56 ckan1 sh[3948]: 6: (()+0x5e6d6) [0x7f2977f736d6] Oct 21 16:29:56 ckan1 sh[3948]: 7: (()+0x5e703) [0x7f2977f73703] Oct 21 16:29:56 ckan1 sh[3948]: 8: (()+0x5e922) [0x7f2977f73922] Oct 21 16:29:56 ckan1 sh[3948]: 9: (ceph::ceph_assertfail(char const, char const, int, char const)+0x278) [0x7df668] Oct 21 16:29:56 ckan1 sh[3948]: 10: (Monitor::win_standalone_election()+0x218) [0x5c38d8] Oct 21 16:29:56 ckan1 sh[3948]: 11: (Monitor::bootstrap()+0x9bb) [0x5c42eb] Oct 21 16:29:56 ckan1 sh[3948]: 12: (Monitor::init()+0xd5) [0x5c4645] Oct 21 16:29:56 ckan1 sh[3948]: 13: (main()+0x2470) [0x5769c0] Oct 21 16:29:56 ckan1 sh[3948]: 14: (libc_startmain()+0xf5) [0x7f2977655ec5] Oct 21 16:29:56 ckan1 sh[3948]: 15: /usr/bin/ceph-mon() [0x5984f7] Oct 21 16:29:56 ckan1 sh[3948]: 2015-10-21 16:29:56.241876 7f297a0ce8c0 -1 ** Caught signal (Aborted) ** Oct 21 16:29:56 ckan1 sh[3948]: in thread 7f297a0ce8c0 Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Oct 21 16:29:56 ckan1 sh[3948]: 1: /usr/bin/ceph-mon() [0x9a98aa] Oct 21 16:29:56 ckan1 sh[3948]: 2: (()+0x10340) [0x7f29791cb340] Oct 21 16:29:56 ckan1 sh[3948]: 3: (gsignal()+0x39) [0x7f297766acc9] Oct 21 16:29:56 ckan1 sh[3948]: 4: (abort()+0x148) [0x7f297766e0d8] Oct 21 16:29:56 ckan1 sh[3948]: 5: (__gnu_cxx::verbose_terminate_handler()+0x155) [0x7f2977f75535] Oct 21 16:29:56 ckan1 sh[3948]: 6: (()+0x5e6d6) [0x7f2977f736d6] Oct 21 16:29:56 ckan1 sh[3948]: 7: (()+0x5e703) [0x7f2977f73703] Oct 21 16:29:56 ckan1 sh[3948]: 8: (()+0x5e922) [0x7f2977f73922] Oct 21 16:29:56 ckan1 sh[3948]: 9: (ceph::ceph_assertfail(char const, char const, int, char const)+0x278) [0x7df668] Oct 21 16:29:56 ckan1 sh[3948]: 10: (Monitor::win_standalone_election()+0x218) [0x5c38d8] Oct 21 16:29:56 ckan1 sh[3948]: 11: (Monitor::bootstrap()+0x9bb) [0x5c42eb] Oct 21 16:29:56 ckan1 sh[3948]: 12: (Monitor::init()+0xd5) [0x5c4645] Oct 21 16:29:56 ckan1 sh[3948]: 13: (main()+0x2470) [0x5769c0] Oct 21 16:29:56 ckan1 sh[3948]: 14: (__libc_startmain()+0xf5) [0x7f2977655ec5] Oct 21 16:29:56 ckan1 sh[3948]: 15: /usr/bin/ceph-mon() [0x5984f7] Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Oct 21 16:29:56 ckan1 sh[3948]: --- begin dump of recent events --- Oct 21 16:29:56 ckan1 sh[3948]: 0> 2015-10-21 16:29:56.241876 7f297a0ce8c0 -1 ** Caught signal (Aborted) ** Oct 21 16:29:56 ckan1 sh[3948]: in thread 7f297a0ce8c0 Oct 21 16:29:56 ckan1 sh[3948]: ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) Oct 21 16:29:56 ckan1 sh[3948]: 1: /usr/bin/ceph-mon() [0x9a98aa] Oct 21 16:29:56 ckan1 sh[3948]: 2: (()+0x10340) [0x7f29791cb340] Oct 21 16:29:56 ckan1 sh[3948]: 3: (gsignal()+0x39) [0x7f297766acc9] Oct 21 16:29:56 ckan1 sh[3948]: 4: (abort()+0x148) [0x7f297766e0d8] Oct 21 16:29:56 ckan1 sh[3948]: 5: (gnu_cxx::verbose_terminate_handler()+0x155) [0x7f2977f75535] Oct 21 16:29:56 ckan1 sh[3948]: 6: (()+0x5e6d6) [0x7f2977f736d6] Oct 21 16:29:56 ckan1 sh[3948]: 7: (()+0x5e703) [0x7f2977f73703] Oct 21 16:29:56 ckan1 sh[3948]: 8: (()+0x5e922) [0x7f2977f73922] Oct 21 16:29:56 ckan1 sh[3948]: 9: (ceph::ceph_assertfail(char const, char const_, int, char const*)+0x278) [0x7df668] Oct 21 16:29:56 ckan1 sh[3948]: 10: (Monitor::win_standalone_election()+0x218) [0x5c38d8] Oct 21 16:29:56 ckan1 sh[3948]: 11: (Monitor::bootstrap()+0x9bb) [0x5c42eb] Oct 21 16:29:56 ckan1 sh[3948]: 12: (Monitor::init()+0xd5) [0x5c4645] Oct 21 16:29:56 ckan1 sh[3948]: 13: (main()+0x2470) [0x5769c0] Oct 21 16:29:56 ckan1 sh[3948]: 14: (__libc_start_main()+0xf5) [0x7f2977655ec5] Oct 21 16:29:56 ckan1 sh[3948]: 15: /usr/bin/ceph-mon() [0x5984f7] Oct 21 16:29:56 ckan1 sh[3948]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Oct 21 16:29:56 ckan1 sh[3948]: --- logging levels --- Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 none Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 lockdep Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 context Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 crush Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_balancer Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_locker Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_log Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_log_expire Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mds_migrator Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 buffer Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 timer Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 filer Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 striper Oct 21 16:29:56 ckan1 sh[3948]: 0/ 1 objecter Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rados Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rbd Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 rbd_replay Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 journaler Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 objectcacher Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 client Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 osd Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 optracker Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 objclass Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 filestore Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 keyvaluestore Oct 21 16:29:56 ckan1 sh[3948]: 1/ 3 journal Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 ms Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 mon Oct 21 16:29:56 ckan1 sh[3948]: 0/10 monc Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 paxos Oct 21 16:29:56 ckan1 sh[3948]: 0/ 5 tp Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 auth Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 crypto Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 finisher Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 heartbeatmap Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 perfcounter Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 rgw Oct 21 16:29:56 ckan1 sh[3948]: 1/10 civetweb Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 javaclient Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 asok Oct 21 16:29:56 ckan1 sh[3948]: 1/ 1 throttle Oct 21 16:29:56 ckan1 sh[3948]: 0/ 0 refs Oct 21 16:29:56 ckan1 sh[3948]: 1/ 5 xio Oct 21 16:29:56 ckan1 sh[3948]: -2/-2 (syslog threshold) Oct 21 16:29:56 ckan1 sh[3948]: 99/99 (stderr threshold) Oct 21 16:29:56 ckan1 sh[3948]: max_recent 10000 Oct 21 16:29:56 ckan1 sh[3948]: max_new 1000 Oct 21 16:29:56 ckan1 sh[3948]: log_file Oct 21 16:29:56 ckan1 sh[3948]: --- end dump of recent events --- Oct 21 16:29:56 ckan1 systemd[1]: deis-store-monitor.service: main process exited, code=exited, status=1/FAILURE Oct 21 16:29:56 ckan1 systemd[1]: Unit deis-store-monitor.service entered failed state. Oct 21 16:29:56 ckan1 systemd[1]: deis-store-monitor.service failed. Oct 21 16:29:56 ckan1 sh[3948]: reraise_fatal: default handler for signal 6 didn't terminate the process?

bacongobbler commented 9 years ago

From the logs, I was able to find http://tracker.ceph.com/issues/3005. From the looks of it this bug occurs when someone removes a monitor while ceph is still booting. Unfortunately this is something that's out of our control/likely wasn't caused through the default setup (read: we haven't hit this bug before in our test infrastructure). I don't think there's a better remedy than killing everything and starting from scratch again.

ok, look... i just did:

  1. deisctl stop platfom
  2. deisctl uninstall platform
  3. ssh to one of the machines in cluster
  4. fleetctl list-units
  5. fleetctl destroy for every unit present
  6. docker ps
  7. docker kill for every container still present

Normally just running deisctl stop platform && deisctl uninstall platform should do that all for you.

gred7 commented 9 years ago

@bacongobbler it is not a normal case, but I can also do it the way you prefer: [greg@localhost] ~$ deisctl stop platform && deisctl uninstall platform ● ▴ ■ ■ ● ▴ Stopping Deis... ▴ ■ ● Router mesh... deis-router@3.service: inactive/dead
deis-router@2.service: inactive/dead
deis-router@1.service: inactive/dead
Data plane... deis-publisher.service: inactive/dead
Control plane... deis-builder.service: inactive/dead
deis-registry@1.service: inactive/dead
deis-controller.service: inactive/dead
deis-database.service: inactive/dead
Logging subsystem... deis-logspout.service: inactive/dead
deis-logger.service: inactive/dead
Storage subsystem... The service 'deis-store-gateway@1.service' failed while stopping.
deis-store-volume.service: inactive/dead
deis-store-metadata.service: inactive/dead
deis-store-daemon.service: inactive/dead
deis-store-monitor.service: inactive/dead
Done.

Please run deisctl start platform to restart Deis. ● ▴ ■ ■ ● ▴ Uninstalling Deis... ▴ ■ ● Router mesh... deis-router@3.service: destroyed deis-router@2.service: destroyed deis-router@1.service: destroyed Data plane... deis-publisher.service: destroyed Control plane... deis-registry@1.service: destroyed deis-controller.service: destroyed deis-database.service: destroyed deis-builder.service: destroyed Logging subsystem... deis-logspout.service: destroyed deis-logger.service: destroyed Storage subsystem... deis-store-volume.service: destroyed deis-store-gateway@1.service: destroyed deis-store-metadata.service: destroyed deis-store-daemon.service: destroyed deis-store-monitor.service: destroyed Done. [greg@localhost aws]$ deictl install platform && deisctl start platform ● ▴ ■ ■ ● ▴ Installing Deis... ▴ ■ ● Storage subsystem... deis-store-daemon.service: loaded deis-store-volume.service: loaded deis-store-metadata.service: loaded deis-store-monitor.service: loaded deis-store-gateway@1.service: loaded Logging subsystem... deis-logger.service: loaded deis-logspout.service: loaded Control plane... deis-registry@1.service: loaded deis-builder.service: loaded deis-database.service: loaded deis-controller.service: loaded Data plane... deis-publisher.service: loaded Router mesh... deis-router@2.service: loaded deis-router@3.service: loaded deis-router@1.service: loaded Done.

Please run deisctl start platform to boot up Deis. ● ▴ ■ ■ ● ▴ Starting Deis... ▴ ■ ● Storage subsystem... deis-store-monitor.service: active/running
deis-store-daemon.service: active/running
deis-store-metadata.service: active/running
deis-store-gateway@1.service: activating/start-post
(hanging)

gred7 commented 9 years ago

anything?

carmstrong commented 9 years ago

@gred7 The errors from your Ceph daemons make me think that somehow, either the IPs of the hosts changed or the data in Ceph was removed. Ceph is now confused and is panicking on startup.

I'd recommend destroying the hosts and starting over if that is at all possible, and installing the latest version following our documentation.

gred7 commented 9 years ago

@carmstrong but then, what if it will happen again after some time? how data is written in the ceph? may be I could restore or see what changed?

carmstrong commented 9 years ago

but then, what if it will happen again after some time?

Something had to have happened here to cause this - we've never seen this before, and it doesn't occur during normal operation. Did you restart hosts? Remove hosts from the cluster? Did their IPs change? Were the data containers accidentally removed?

gred7 commented 9 years ago

Did you restart hosts? - no. Remove hosts from the cluster? - no. Did their IPs change? - no. Were the data containers accidentally removed? - may be, how can i check for sure? may be I could restore them?

gred7 commented 8 years ago

Ok, I am closing this issue. It reveals that we need Ceph-less setup anyway.

DenisIzmaylov commented 8 years ago

I've the same troubles. First time it was a failure for deis-gateway@1, then I fixed that somehow and got a failure for deis-builder. We have to handle those errors better because I've found > 10 unresolved issues with the same symptoms.

bacongobbler commented 8 years ago

@DenisIzmaylov we're currently building Deis v2 on top of kubernetes without Ceph. A lot of these issues have gone away by using a more sophisticated scheduler with service discovery built in. Most (if not all) of these issues have gone away.

If you're up for testing v2, we have a temporary location for v2 docs available at http://docs-v2.readthedocs.org/en/latest/. :)

DenisIzmaylov commented 8 years ago

Wow, thank you for extra fast response. When do you plan to release v2? Generally I'm afraid to install betas or even more first major releases. I prefer to install any one of first minor releases (e.g. 2.1, 1.3, etc) if I do not have a time for research and contributing.

bacongobbler commented 8 years ago

I'm not aware of what our target is for a stable release. ping @carmstrong

DenisIzmaylov commented 8 years ago

Btw, I have firewall configuration from Deis: https://raw.githubusercontent.com/deis/deis/master/contrib/util/custom-firewall.sh

Ceph has Network Reference: http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/

Also Ceph has Preflight Checklist.

Is that's all OK here? Because:

wwwprod@www1 ~ $ nse deis-store-monitor
root@www1:/# ceph -s
2016-03-04 18:40:22.365990 7f6c3c132700  0 -- :/1000089 >> 10.91.119.xxx:6789/0 pipe(0x7f6c38064010 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6c3805c560).fault
2016-03-04 18:40:22.365990 7f6c3c132700  0 -- :/1000089 >> 10.91.119.xxx:6789/0 pipe(0x7f6c38064010 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6c3805c560).fault
2016-03-04 18:40:25.366045 7f6c34ded700  0 -- :/1000089 >> 10.91.119.yyy:6789/0 pipe(0x7f6c2c000c00 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6c2c004ef0).fault
2016-03-04 18:40:25.366045 7f6c34ded700  0 -- :/1000089 >> 10.91.119.yyy:6789/0 pipe(0x7f6c2c000c00 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6c2c004ef0).fault
bacongobbler commented 8 years ago

The firewall allows internal communication across the nodes in your fleet cluster, and blocks all incoming connections except on ports 22,2222,80 and 443. See https://github.com/deis/deis/blob/master/contrib/util/custom-firewall.sh#L37-L51.

The document you showed just recommends a way to set up the network topology between nodes, which in that case then yes it "respects" that configuration since all communication between nodes is allowed.

DenisIzmaylov commented 8 years ago

I asked this question because when I try to connect between nodes I get a connection refused error:

wwwprod@www1 ~ $ curl 10.91.119.xxx:6789
curl: (7) Failed to connect to 10.91.119.xxx port 6789: Connection refused
wwwprod@www1 ~ $ curl 10.91.119.yyy:6789
curl: (7) Failed to connect to 10.91.119.yyy port 6789: Connection refused
wwwprod@www1 ~ $ curl 10.91.119.zzz:6789
curl: (7) Failed to connect to 10.91.119.zzz port 6789: Connection refused
bacongobbler commented 8 years ago

@DenisIzmaylov can you please open a separate issue? Make sure you include what cloud provider you provisioned on, what version, reproduction steps to setup your cluster, etc. It's likely that your fleet cluster did not start properly and this line was not able to determine the IP addresses in your cluster, which would block all communication across the cluster. It's best to start tackling this issue separately so we can better determine your issue. Thanks!

DenisIzmaylov commented 8 years ago

Yes, sure. Thank you for your fast and detailed answers.

I have uninstalled Deis, destroyed all Docker containers, removed all Docker images on each node:

deisctl stop platform
deisctl uninstall platform
fleetctl stop deis-store-admin.service
docker rm $(docker ps -a -q)
docker rmi  $(docker images | grep "deis" | awk '{print $3}')

And now I will reboot each node and try to install Deis again. If I will get errors - I will create new issue.

UPDATED: nothing. :\

UPDATED2: I have found http://docs.deis.io/en/latest/managing_deis/recovering-ceph-quorum/#recovering-ceph-quorum and going to follow this instruction.