excessive delay on new deployment where invoker does not appear to receive kafka message from controller - Githubissues

apache / openwhisk

Apache OpenWhisk is an open source serverless cloud platform

https://openwhisk.apache.org/

Apache License 2.0

6.5k stars 1.16k forks source link

excessive delay on new deployment where invoker does not appear to receive kafka message from controller #713

Closed rabbah closed 8 years ago

rabbah commented 8 years ago

Observed on master branch, running locally on a mac with docker-machine. Steps to reproduce:

> redo setup prereqs couchdb initidb deploy
> wsk action invoke /whisk.system/samples/echo -u $(cat ansible/files/auth.guest)

Watch controller logs to confirm that kafka message posted to message bug:

> wskadmin syslog get 29 controller
...
[2016-06-21T14:50:18.584Z] [INFO] [#tid_29] [LoadBalancerService] user has 1 activations posted. Posted to invoke0[0][0]

Watch invoker logs and confirm that somewhere between 4-5 minutes later, the invoker should pick up the kafka message and invoke the action:

> wskadmin syslog get 29 invoker0 # repeat several minutes later
[2016-06-21T14:54:45.791Z] [INFO] [#tid_29] [Dispatcher] matching message to 1 handlers: invoker

@psuter @perryibm

rabbah commented 8 years ago

@lehoanganh @domdom82 Please review the commit above (https://github.com/psuter/openwhisk/commit/7f8acd7def29f99d75de204fecd1a1e2ed6c8df4). The ansible roles for kafka and zookeeper were not correct and did not preserve the semantics of isAlive. This caused some combination of zookeeper/kafka misconfiguration that surfaces in the invoker with the symptom described above. This symptom seems to resolve itself after several minutes. We have not yet completely explained the issue but restoring isAlive semantics is important.

The reason this symptom does not surface in Travis CI or local tests (unless you repeat the steps outlined above) is that there are several minutes worth of tests that run before the first test that required the invoker to be involved.

All roles should be checked for semantic equivalence with isAlive. @psuter and I fixed the two immediate ones that are impacting/blocking work related to streaming db support (needed for zip file uploads and the like). You should verify the other components are not equally giving false positive "is alive" responses.

consul alive check is affected by this as well - possibly others. I defer to you to investigate further.

FYI @jeremiaswerner

rabbah commented 8 years ago

Reopened - observing a similar issue with controller on active ack.

rabbah commented 8 years ago

Bumping zookeeper/kafka gap hid the issue again. This is unsatisfactory - need to dig and understand what's going on with zk/kafka at startup.

@perryibm this affect controller as well with active response.

lehoanganh commented 8 years ago

you meant, with the changes in open, the blue deployment is stil failed?

perryibm commented 8 years ago

I observed this on my own local deployment on ubuntu rather than mac. I did not experience the controller->invoker problem but repeatedly have run into the active ack (invoker->controller) where it takes 5 minutes before the path starts working - resulting in all blocking calls taking 30 seconds.

@rabbah Unfortunately, the sleep did not fix the problem with the reverse path. Even a 5 second sleep did not help and even a 5 second pre-sleep and a 5-second post-sleep did not help (around zk).

Update: I tried a 5 minute pause after kafka came up. Still no good so I am not sure it is entirely zk/kafka deployment that is at fault. I manually added back autoCommit and no improvements.

perryibm commented 8 years ago

I tested Kafka 0.10.0.0 (scala 2.11) and changing the server alone did not fix this. However, using the 0.10.0.0 client libs appear to have fixed the problem. I will be retesting this and hopefully then adding a new unit test with a name like (aaa.aaa....) so that it runs first as soon after deploy as possible.

perryibm commented 8 years ago

Unfortunately retesting showed the problem is still present and now appears to be hitting the controller->invoker path instead. This is slightly comforting because at least it's not specific to a path which makes no sense. But back to the drawing board....

perryibm commented 8 years ago

So it's not that the period of badness is 5 minutes. Every single post is held back for about 5 minutes. This was obscured (mitigated) by active ack timing out so that 30 seconds seem like the timing to look for.

[2016-06-30T18:03:49.471Z] [INFO] [#tid_65] [LoadBalancerService] user has 1 activations posted. Posted to invoke0[0][0]
[2016-06-30T18:05:01.685Z] [INFO] [#tid_72] [LoadBalancerService] user has 2 activations posted. Posted to invoke0[0][1]
[2016-06-30T18:06:09.062Z] [INFO] [#tid_77] [LoadBalancerService] user has 3 activations posted. Posted to invoke0[0][2]
[2016-06-30T18:07:17.123Z] [INFO] [#tid_86] [LoadBalancerService] user has 4 activations posted. Posted to invoke0[0][3]

[2016-06-30T18:07:41.818Z] [INFO] [#tid_65] [Dispatcher] matching message to 1 handlers: invoker
[2016-06-30T18:07:41.898Z] [INFO] [#tid_72] [Dispatcher] matching message to 1 handlers: invoker
[2016-06-30T18:07:41.899Z] [INFO] [#tid_77] [Dispatcher] matching message to 1 handlers: invoker
[2016-06-30T18:07:41.902Z] [INFO] [#tid_86] [Dispatcher] matching message to 1 handlers: invoker
[2016-06-30T18:07:45.744Z] [INFO] [#tid_88] [Dispatcher] matching message to 1 handlers: invoker

perryibm commented 8 years ago

This is a frustrating bug....

The problem doesn't occur every time but seems to happen most of the time. (> 75%) Having to re-deploy the system repeatedly makes this a time-consuming experiment.
Upgrading to kafka 0.10 (both kafka and client libs) apparently did not affect this bug.
While this problem was happening (in those runs), I was able to run a producer/consumer pair (https://github.com/mapr-demos/kafka-sample-programs.git) on another topic during the 5 minute period without any bad behavior. This suggests there is nothing wrong with the kafka and zk containers but that the fault lies with the client libraries.
Checking consumer offsets did not reveal anything because this is consumer-centric data.
Long weekend follows....
The bug appears to have disappeared and just now it worked 5 times in a row. Review of the invoker and controller code shows no reason why invoker->controller would act any differently than the controller->invoker.
The kafka and zk containers are started without mapping to the host so there should be no state passing between subsequent container runs.

rabbah commented 8 years ago

@perryibm did you check if the magical 2 seconds can be removed yet from openwhisk?

perryibm commented 8 years ago

For me, the magical 2 seconds is not needed anymore. However, I was never able to reproduce this variant of the bug so my testing is not definitive. I think our distributed tests also did not show this up so I think you will have to test this as it showed reliably in your setup.