Regression: 'unable to determine where zookeeper is located' exception when Zookeeper info is in /etc/mesos/zk

pferrot commented 8 years ago

It seems there is a regression with the recent autodeskcloud/pod:1.0.6 image. The same image tag used to work very well in ECS, but now it crashed at startup with the following exception:

2015-10-24 13:06:39,634 - CRITICAL - unexpected condition -> ..eworks/marathon.py (228) -> AssertionError (unable to determine where zookeeper is located (unsup
ported/bogus mesos setup ?))

I could fix it in my fork by adding the "assert hints['zk']" snippet in /ochopod/frameworks/marathon.py (see below and https://github.com/pferrot/ochopod/commit/a436f63a87ec26e46e826a9eabda7c883b25c8d9).

I am not sure about the root cause, but essentially, it seems that the _1() method does not make an exception anymore (as it is supposed to when no zookeeper info can be found), but returns an empty string instead, which basically means that _2() and _3() will not be executed.

#
# - depending on how the slave has been installed we might have to look in various places
#   to find out what our zookeeper connection string is
# - warning, a URL like format such as zk://<ip:port>,..,<ip:port>/mesos is used
# - just keep the ip & port part and discard the rest
#
for method in [_1, _2, _3]:
    try:
        hints['zk'] = method()
        assert hints['zk']
        break

    except:
        pass

opaugam commented 8 years ago

what version/release of mesosphere are you using ? i'm running this in ECS using the mesosphere package install .. do you have an empty /etc/mesos/zk file ?

pferrot commented 8 years ago

Salut Olivier,

That is the point: the /etc/mesos/zk file does exist and contains the proper zookeeper information.

I installed Mesos+Marathon following the instructions here: https://www.digitalocean.com/community/tutorials/how-to-configure-a-production-ready-mesosphere-cluster-on-ubuntu-14-04#tutorial_series_31

On the host machine (Mesos slave), the /opt/mesosphere folder exists but is empty. The zk file is under /etc/mesos. So of course I mounted /etc/mesos in Ochothon as you can see here (btw maybe it should be the same in the original Ochothon): https://github.com/pferrot/ochothon/blob/8b0ca9ab24a9282886aaa7023dd2367f0dc29172/dcos.json#L37-L39

But still, I get the exception above without the suggested fix in Ochopod. Again: this used to work fine with the previous version of the autodeskcloud/ochothon:1.0.0 image, so not sure where the regression comes from exactly.

Cheers, Patrice

opaugam commented 8 years ago

ok .. not sure what to say at this point.

on the image that fails to start the pod.py could you do the following ->

deploy it with a large timeout so that it does not get killed too fast
docker exec into the container
edit the framework/marathon.py in /usr/local/lib/python2.7/ochopod and print what shell() returns line 187

opaugam commented 8 years ago

actually - point me to a public image showing this issue and which i can deploy on my own ECS cluster .. i can troubleshoot it

opaugam commented 8 years ago

nah - i can reproduce it too .. this is random ?! i'm looking at it..

opaugam commented 8 years ago

mh ok - this is really super fishy .. so far it looks like in some cases the mount to /etc and /opt is not done properly, which may be some docker issue on the ECS VMs.. Please confirm the following : if you attempt to deploy the same image on ECS a few times it will eventually succeed.

opaugam commented 8 years ago

nope - this is a bug in the ochopod.core.utils shell() method ..

opaugam commented 8 years ago

yup - found it .. i'm again a complete moron

opaugam commented 8 years ago

ok - i re-pushed new pod:1.0.6 & ochothon:1.0.0 images with the fix.. make sure you docker pull pod:1.0.6 and rebuild/test your image on ECS

pferrot commented 8 years ago

Fantastic, thanks for the very quick fix :-)

autodesk-cloud / ochopod

Regression: 'unable to determine where zookeeper is located' exception when Zookeeper info is in /etc/mesos/zk #33