autodesk-cloud / ochopod

Your friendly orchestration overlay over Mesos, K8S and more !
http://autodesk-cloud.github.io/ochopod/
Apache License 2.0
122 stars 20 forks source link

Issue with wildcard dependencies: fail to detect new pod #36

Closed pferrot closed 8 years ago

pferrot commented 8 years ago

I am afraid that there is a bug with wildcard dependencies.

In my pull request, there was special handling in case the parent node exists but the 'snapshot' node does not exist yet, see https://github.com/autodesk-cloud/ochopod/pull/32/files and those few lines in particular:

fullPath = '%s/%s/snapshot' % (ROOT, child)
# 
# - the parent node may exist before the 'snapshot' node does
#
if self.zk.exists(fullPath, watch=self.feedback):

It seems that this logic has been dropped when the pull request has been integrated. As a result, it can happen that a pod fails to be notified when a "wildcard dependency" pod joins.

opaugam commented 8 years ago

the zk.get would fail as well if the parent does not exist, plus it will leave the watch on.. how do you reproduce this bug ?

pferrot commented 8 years ago

From the behavior, it does not seem that the watch is on when zk.get makes a NoNodeError (actually it would be weird if it were on since an exception is thrown, would it not?).

I can reproduce by adding pods that should trigger a notification to the watcher pod after the watcher pod is deployed. But it is not constantly reproducible since it is a race condition issue. I can pretty often reproduce though.

opaugam commented 8 years ago

ok - give me a scenario to test against in the CLI. I'll look at it today whenever i can.

pferrot commented 8 years ago

Well, the scenario is:

Again: since this is a race condition issue, the test may succeed from time to time.

I fixed the issue in my fork and it seems to work as expected every time now, see https://github.com/pferrot/ochopod/blob/715d271d2a549e033a7d1f32c7e06a1eb890c0c4/ochopod/watchers/remote.py#L107-L110

opaugam commented 8 years ago

ok - fix mirroring you branch coming within the next 6 hours .. i'm totally swamped today :(

opaugam commented 8 years ago

aaah no i get it - this is because we reset the flip-flop .. the normal case does not require that extra call to exists() because if it fails (snapshot not there yet) the self.query flip-flop will be left to true..

man i told you i'm getting too old to code..

opaugam commented 8 years ago

pod 1.0.7 updated