Closed pferrot closed 8 years ago
what version/release of mesosphere are you using ? i'm running this in ECS using the mesosphere package install .. do you have an empty /etc/mesos/zk file ?
Salut Olivier,
That is the point: the /etc/mesos/zk file does exist and contains the proper zookeeper information.
I installed Mesos+Marathon following the instructions here: https://www.digitalocean.com/community/tutorials/how-to-configure-a-production-ready-mesosphere-cluster-on-ubuntu-14-04#tutorial_series_31
On the host machine (Mesos slave), the /opt/mesosphere folder exists but is empty. The zk file is under /etc/mesos. So of course I mounted /etc/mesos in Ochothon as you can see here (btw maybe it should be the same in the original Ochothon): https://github.com/pferrot/ochothon/blob/8b0ca9ab24a9282886aaa7023dd2367f0dc29172/dcos.json#L37-L39
But still, I get the exception above without the suggested fix in Ochopod. Again: this used to work fine with the previous version of the autodeskcloud/ochothon:1.0.0 image, so not sure where the regression comes from exactly.
Cheers, Patrice
ok .. not sure what to say at this point.
on the image that fails to start the pod.py could you do the following ->
actually - point me to a public image showing this issue and which i can deploy on my own ECS cluster .. i can troubleshoot it
nah - i can reproduce it too .. this is random ?! i'm looking at it..
mh ok - this is really super fishy .. so far it looks like in some cases the mount to /etc and /opt is not done properly, which may be some docker issue on the ECS VMs.. Please confirm the following : if you attempt to deploy the same image on ECS a few times it will eventually succeed.
nope - this is a bug in the ochopod.core.utils shell() method ..
yup - found it .. i'm again a complete moron
ok - i re-pushed new pod:1.0.6 & ochothon:1.0.0 images with the fix.. make sure you docker pull pod:1.0.6 and rebuild/test your image on ECS
Fantastic, thanks for the very quick fix :-)
It seems there is a regression with the recent autodeskcloud/pod:1.0.6 image. The same image tag used to work very well in ECS, but now it crashed at startup with the following exception:
I could fix it in my fork by adding the "assert hints['zk']" snippet in /ochopod/frameworks/marathon.py (see below and https://github.com/pferrot/ochopod/commit/a436f63a87ec26e46e826a9eabda7c883b25c8d9).
I am not sure about the root cause, but essentially, it seems that the _1() method does not make an exception anymore (as it is supposed to when no zookeeper info can be found), but returns an empty string instead, which basically means that _2() and _3() will not be executed.