edwardcapriolo / teknek

12 stars 4 forks source link

Add blacklisting #18

Open edwardcapriolo opened 10 years ago

edwardcapriolo commented 10 years ago

We recently had a plan that was misconfiguration and could not start for some reason. It correctly kept firing off WorkerStart exception, but that gets somewhat spammy and may turn into a fork bomb. We should track failed start up attempts potentially inside a new path in ZK. If a process attempts to start N times and fails we should blacklist it for some period of time. Also the code that determines job start TeknekDeamon.considerStarting() is getting fairly beefy and a touch hard to test. This would be a nice time to refactor it in a way that would easy testability. @sinemetu1

edwardcapriolo commented 10 years ago

So to break down the scenario more clearly. We had a plan that was designed to read from kafka and write to cassandra. The setProperties method of the operator was attempting to establish a astyanax connection pool, which was failing because of a misconfiguration. Each scan cycle a worker attempted to start the operator, it failed because of a RuntimeException. These are probably being logged at info, which should be raised to warn. It would be nice if the cluster failed to start a given operator a certain number of times it created an entry in zk that would but the plan to sleep for a while without permanently disabling it. Other workers could notice this in the considerStarting phase and return quickly.

edwardcapriolo commented 10 years ago

As a first pass I cleaned up the logging and made it more consistent and utilized the proper log levels. This probably might have helped us locate the misconfiguration sooner. https://github.com/edwardcapriolo/teknek-core/pull/9