Netflix / SimianArmy

Tools for keeping your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.
Apache License 2.0
7.98k stars 1.14k forks source link

BasicChaosMonkey.doMonkeyBusiness() method exit without finishing its job #274

Open yufengJ opened 8 years ago

yufengJ commented 8 years ago

Hi all,

I've observed that during BasicChaosMonkey.doMonkeyBusiness(), the method suddenly returned without finishing rest of it's happy-path. There's no exception nor error messages.

The jettyRun output is as follow:

2016-09-08 16:31:16.328 - INFO  BasicChaosInstanceSelector - [BasicChaosInstanceSelector.java:65] Randomly selecting 1 from 3 instances, excluding null
2016-09-08 16:31:16.563 - INFO  Monkey - [Monkey.java:138] Reporting what I did...

I've set up the debugger to trace this. The code end up into org.jclouds.ContextBuilde. The stack dump is:

"pool-1-thread-1@9515" prio=5 tid=0x1d nid=NA runnable
  java.lang.Thread.State: RUNNABLE
    at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:588)
    at com.netflix.simianarmy.client.aws.AWSClient.getJcloudsComputeService(AWSClient.java:818)
    - locked <0x2989> (a com.netflix.simianarmy.client.aws.AWSClient)
    at com.netflix.simianarmy.client.aws.AWSClient.connectSsh(AWSClient.java:834)
    at com.netflix.simianarmy.chaos.ChaosInstance.connectSsh(ChaosInstance.java:123)
    at com.netflix.simianarmy.chaos.ChaosInstance.canConnectSsh(ChaosInstance.java:101)
    at com.netflix.simianarmy.chaos.ScriptChaosType.canApply(ScriptChaosType.java:60)
    at com.netflix.simianarmy.basic.chaos.BasicChaosMonkey.pickChaosType(BasicChaosMonkey.java:141)
    at com.netflix.simianarmy.basic.chaos.BasicChaosMonkey.doMonkeyBusiness(BasicChaosMonkey.java:121)
    at com.netflix.simianarmy.Monkey.run(Monkey.java:134)
    at com.netflix.simianarmy.Monkey$1.run(Monkey.java:155)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

I've observed the issue on master branch and tag v2.5.1. Tag v2.5.0 is fine though and I was using it well. So i am suspecting it's because some dependency changes in between that is causing this. However a diff between build.gradle of different tags shows me that jcloud is not upgraded during these two tags. So i am confused as where to go next.

$ diff master_branch/build.gradle tag_v2.5.0/build.gradle
1,6d0
< buildscript {
<     repositories {
<         jcenter()
<     }
< }
<
8c2
<     id 'nebula.netflixoss' version '3.2.3'

---
>     id 'nebula.netflixoss' version '2.2.9'
18c12
< repositories {

---
> repositories {
26,28d19
< sourceCompatibility = 1.7
< targetCompatibility = 1.7
<
36c27,28
<     compile 'com.sun.jersey:jersey-servlet:1.19'

---
>     compile 'com.sun.jersey:jersey-core:1.11'
>     compile 'com.sun.jersey:jersey-servlet:1.11'
40c32,34
<     compile 'com.netflix.eureka:eureka-client:1.4.1'

---
>     compile('com.netflix.eureka:eureka-client:1.1.22') {
>         exclude group: 'com.sun.jersey', module: 'jersey-bundle'
>     }
49a44
>     compile 'ch.qos.logback:logback-classic:1.0.13'
51,52d45
<     compile 'org.springframework:spring-jdbc:4.2.5.RELEASE'
<     compile 'com.zaxxer:HikariCP:2.4.7'

I might dig deeper into this. Has anyone got this issue before?

ebukoski commented 8 years ago

I ran into something very similar with Janitor: it died in an AWS API call with no log message. To trace it, I created a one-off JSP that invoked the same API call so I could get a full stack trace. In my case it was a version mismatch between the AWS client library in open source SimianArmy and a different AWS client jar that was being pulled in by our non-open source version.

I upgraded the AWS client to 1.11.9 and it resolved the issue for me. I have an open PR to introduce this to the main code line.

On Thu, Sep 8, 2016 at 6:03 PM, Yufeng notifications@github.com wrote:

Hi all,

I've observed that during BasicChaosMonkey.doMonkeyBusiness(), the method suddenly returned without finishing rest of it's happy-path. There's no exception nor error messages.

The jettyRun output is as follow:

2016-09-08 16:31:16.328 - INFO BasicChaosInstanceSelector - [BasicChaosInstanceSelector.java:65] Randomly selecting 1 from 3 instances, excluding null 2016-09-08 16:31:16.563 - INFO Monkey - [Monkey.java:138] Reporting what I did...

I've set up the debugger to trace this. The code end up into org.jclouds.ContextBuilde. The stack dump is:

"pool-1-thread-1@9515" prio=5 tid=0x1d nid=NA runnable java.lang.Thread.State: RUNNABLE at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:588) at com.netflix.simianarmy.client.aws.AWSClient.getJcloudsComputeService(AWSClient.java:818)

  • locked <0x2989> (a com.netflix.simianarmy.client.aws.AWSClient) at com.netflix.simianarmy.client.aws.AWSClient.connectSsh(AWSClient.java:834) at com.netflix.simianarmy.chaos.ChaosInstance.connectSsh(ChaosInstance.java:123) at com.netflix.simianarmy.chaos.ChaosInstance.canConnectSsh(ChaosInstance.java:101) at com.netflix.simianarmy.chaos.ScriptChaosType.canApply(ScriptChaosType.java:60) at com.netflix.simianarmy.basic.chaos.BasicChaosMonkey.pickChaosType(BasicChaosMonkey.java:141) at com.netflix.simianarmy.basic.chaos.BasicChaosMonkey.doMonkeyBusiness(BasicChaosMonkey.java:121) at com.netflix.simianarmy.Monkey.run(Monkey.java:134) at com.netflix.simianarmy.Monkey$1.run(Monkey.java:155) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

I've observed the issue on master branch and tag v2.5.1. Tag v2.5.0 is fine though and I was using it well. So i am suspecting it's because some dependency changes in between that is causing this. However a diff between build.gradle of different tags shows me that jcloud is not upgraded during these two tags. So i am confused as where to go next.

$ diff master_branch/build.gradle tag_v2.5.0/build.gradle 1,6d0 < buildscript { < repositories { < jcenter() < } < } < 8c2

< id 'nebula.netflixoss' version '3.2.3'

id 'nebula.netflixoss' version '2.2.9'

18c12

< repositories {

repositories { 26,28d19 < sourceCompatibility = 1.7 < targetCompatibility = 1.7 < 36c27,28

< compile 'com.sun.jersey:jersey-servlet:1.19'

compile 'com.sun.jersey:jersey-core:1.11'
compile 'com.sun.jersey:jersey-servlet:1.11'

40c32,34

< compile 'com.netflix.eureka:eureka-client:1.4.1'

compile('com.netflix.eureka:eureka-client:1.1.22') {
    exclude group: 'com.sun.jersey', module: 'jersey-bundle'
}

49a44 compile 'ch.qos.logback:logback-classic:1.0.13' 51,52d45 < compile 'org.springframework:spring-jdbc:4.2.5.RELEASE' < compile 'com.zaxxer:HikariCP:2.4.7'

I might dig deeper into this. Has anyone got this issue before?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Netflix/SimianArmy/issues/274, or mute the thread https://github.com/notifications/unsubscribe-auth/AKXxgfImWXScybt-Yx2W8lvb6gm0po5Pks5qoLBbgaJpZM4J4mfs .

yufengJ commented 8 years ago

Thanks for suggestions! It turned out it's the same issue as https://github.com/Netflix/SimianArmy/issues/259.

Problem was fixed by fixing the dependency

compile ('com.netflix.eureka:eureka-client:1.4.1') {
        exclude group: 'com.google.inject'
}
pwhitham commented 8 years ago

Nice! I ran into this just recently and the dependency exclusion also solved the issue for me

Thanks!