Netflix / SimianArmy

Tools for keeping your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.
Apache License 2.0
7.98k stars 1.13k forks source link

Unable to perform SSH cases - HTTP 401 being returned from a wrong region #285

Open VinnieGogniti opened 7 years ago

VinnieGogniti commented 7 years ago

Hello Everyone,

I've been stuck with this issue for a week now. I've looked around all the threads related to this issue and apparently it's an open issue and there is no definitive solution yet.

The issue is - even though I have the region to use (in my client config) is "us-west-2", my SSH cases are failing with a HTTP 401 at a wrong region. I scanned through the entire code and replaced all the "us-east-1" references to "us-west-2", but still I'm unable to get around this issue. I believe the code as got to be making an AWS SDK call to fetch the current region via API and somehow getting "us-east-1" retuned and overrides my config. This has absolutely baffled me for days now.

Please, anyone who had resolved this earlier or can think of a better solution, help me resolve this. Following is the error log. Thank you!

2016-12-13 05:24:05.356 - INFO BasicChaosInstanceSelector - [BasicChaosInstanceSelector.java:65] Randomly selecting 2 from 2 instances, excluding null 2016-12-13 05:24:07.084 - WARN ChaosInstance - [ChaosInstance.java:105] Error making SSH connection to instance org.jclouds.rest.AuthorizationException: POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized . . . . 2016-12-13 05:24:07.089 - WARN ScriptChaosType - [ScriptChaosType.java:61] Strategy disabled because SSH credentials failed 2016-12-13 05:24:07.089 - WARN BasicChaosMonkey - [BasicChaosMonkey.java:124] No chaos type was applicable to the instance: i-009863xxxxxx 2016-12-13 05:24:07.205 - WARN ChaosInstance - [ChaosInstance.java:105] Error making SSH connection to instance org.jclouds.rest.AuthorizationException: POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized at org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent.refineException(ParseAWSErrorFromXmlContent.java:122)

ebukoski commented 7 years ago

Which region are you running Chaos Monkey and which region has the instance you are trying to terminate?

On Mon, Dec 12, 2016 at 9:37 PM, VinnieGogniti notifications@github.com wrote:

Hello Everyone,

I've been stuck with this issue for a week now. I've looked around all the threads related to this issue and apparently it's an open issue and there is no definitive solution yet.

The issue is - even though I have the region to use (in my client config) is "us-west-2", my SSH cases are failing with a HTTP 401 at a wrong region. I scanned through the entire code and replaced all the "us-east-1" references to "us-west-2", but still I'm unable to get around this issue. I believe the code as got to be making an AWS SDK call to fetch the current region via API and somehow getting "us-east-1" retuned and overrides my config. This has absolutely baffled me for days now.

Please, anyone who had resolved this earlier or can think of a better solution, help me resolve this. Following is the error log. Thank you!

2016-12-13 05:24:05.356 - INFO BasicChaosInstanceSelector - [BasicChaosInstanceSelector.java:65] Randomly selecting 2 from 2 instances, excluding null 2016-12-13 05:24:07.084 - WARN ChaosInstance - [ChaosInstance.java:105] Error making SSH connection to instance org.jclouds.rest.AuthorizationException: POST https://ec2.us-east-1. amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized . . . . 2016-12-13 05:24:07.089 - WARN ScriptChaosType - [ScriptChaosType.java:61] Strategy disabled because SSH credentials failed 2016-12-13 05:24:07.089 - WARN BasicChaosMonkey - [BasicChaosMonkey.java:124] No chaos type was applicable to the instance: i-009863xxxxxx 2016-12-13 05:24:07.205 - WARN ChaosInstance - [ChaosInstance.java:105] Error making SSH connection to instance org.jclouds.rest.AuthorizationException: POST https://ec2.us-east-1. amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized at org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent.refineException( ParseAWSErrorFromXmlContent.java:122)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Netflix/SimianArmy/issues/285, or mute the thread https://github.com/notifications/unsubscribe-auth/AKXxgbZPzYQEIrIjCAvpwOYsIxm8hdPwks5rHi8JgaJpZM4LLW8B .

VinnieGogniti commented 7 years ago

Both are "us-west-2".

On Mon, Dec 12, 2016 at 9:42 PM ebukoski notifications@github.com wrote:

Which region are you running Chaos Monkey and which region has the instance

you are trying to terminate?

On Mon, Dec 12, 2016 at 9:37 PM, VinnieGogniti notifications@github.com

wrote:

Hello Everyone,

I've been stuck with this issue for a week now. I've looked around all the

threads related to this issue and apparently it's an open issue and there

is no definitive solution yet.

The issue is - even though I have the region to use (in my client config)

is "us-west-2", my SSH cases are failing with a HTTP 401 at a wrong region.

I scanned through the entire code and replaced all the "us-east-1"

references to "us-west-2", but still I'm unable to get around this issue. I

believe the code as got to be making an AWS SDK call to fetch the current

region via API and somehow getting "us-east-1" retuned and overrides my

config.

This has absolutely baffled me for days now.

Please, anyone who had resolved this earlier or can think of a better

solution, help me resolve this. Following is the error log. Thank you!

2016-12-13 05:24:05.356 - INFO BasicChaosInstanceSelector -

[BasicChaosInstanceSelector.java:65] Randomly selecting 2 from 2

instances, excluding null

2016-12-13 05:24:07.084 - WARN ChaosInstance - [ChaosInstance.java:105]

Error making SSH connection to instance

org.jclouds.rest.AuthorizationException: POST https://ec2.us-east-1.

amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized

.

.

.

.

2016-12-13 05:24:07.089 - WARN ScriptChaosType - [ScriptChaosType.java:61]

Strategy disabled because SSH credentials failed

2016-12-13 05:24:07.089 - WARN BasicChaosMonkey -

[BasicChaosMonkey.java:124] No chaos type was applicable to the instance:

i-009863xxxxxx

2016-12-13 05:24:07.205 - WARN ChaosInstance - [ChaosInstance.java:105]

Error making SSH connection to instance

org.jclouds.rest.AuthorizationException: POST https://ec2.us-east-1.

amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized

at org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent.refineException(

ParseAWSErrorFromXmlContent.java:122)

You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub

https://github.com/Netflix/SimianArmy/issues/285, or mute the thread

< https://github.com/notifications/unsubscribe-auth/AKXxgbZPzYQEIrIjCAvpwOYsIxm8hdPwks5rHi8JgaJpZM4LLW8B

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Netflix/SimianArmy/issues/285#issuecomment-266650562, or mute the thread https://github.com/notifications/unsubscribe-auth/AXZt1KfkqVOCCOVHm23LK6QOxz0EGGUTks5rHjBdgaJpZM4LLW8B .

ebukoski commented 7 years ago

This is the important line here https://github.com/Netflix/SimianArmy/blob/master/src/main/java/com/netflix/simianarmy/basic/BasicSimianArmyContext.java#L137 .

Check to make sure the property https://github.com/Netflix/SimianArmy/wiki/Client-Settings simianarmy.client.aws.region is set and being consumed by Chaos Monkey.

On Mon, Dec 12, 2016 at 9:44 PM, VinnieGogniti notifications@github.com wrote:

Both are "us-west-2".

On Mon, Dec 12, 2016 at 9:42 PM ebukoski notifications@github.com wrote:

Which region are you running Chaos Monkey and which region has the instance

you are trying to terminate?

On Mon, Dec 12, 2016 at 9:37 PM, VinnieGogniti <notifications@github.com

wrote:

Hello Everyone,

I've been stuck with this issue for a week now. I've looked around all the

threads related to this issue and apparently it's an open issue and there

is no definitive solution yet.

The issue is - even though I have the region to use (in my client config)

is "us-west-2", my SSH cases are failing with a HTTP 401 at a wrong region.

I scanned through the entire code and replaced all the "us-east-1"

references to "us-west-2", but still I'm unable to get around this issue. I

believe the code as got to be making an AWS SDK call to fetch the current

region via API and somehow getting "us-east-1" retuned and overrides my

config.

This has absolutely baffled me for days now.

Please, anyone who had resolved this earlier or can think of a better

solution, help me resolve this. Following is the error log. Thank you!

2016-12-13 05:24:05.356 - INFO BasicChaosInstanceSelector -

[BasicChaosInstanceSelector.java:65] Randomly selecting 2 from 2

instances, excluding null

2016-12-13 05:24:07.084 - WARN ChaosInstance - [ChaosInstance.java:105]

Error making SSH connection to instance

org.jclouds.rest.AuthorizationException: POST https://ec2.us-east-1.

amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized

.

.

.

.

2016-12-13 05:24:07.089 - WARN ScriptChaosType - [ScriptChaosType.java:61]

Strategy disabled because SSH credentials failed

2016-12-13 05:24:07.089 - WARN BasicChaosMonkey -

[BasicChaosMonkey.java:124] No chaos type was applicable to the instance:

i-009863xxxxxx

2016-12-13 05:24:07.205 - WARN ChaosInstance - [ChaosInstance.java:105]

Error making SSH connection to instance

org.jclouds.rest.AuthorizationException: POST https://ec2.us-east-1.

amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized

at org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent. refineException(

ParseAWSErrorFromXmlContent.java:122)

You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub

https://github.com/Netflix/SimianArmy/issues/285, or mute the thread

< https://github.com/notifications/unsubscribe-auth/ AKXxgbZPzYQEIrIjCAvpwOYsIxm8hdPwks5rHi8JgaJpZM4LLW8B

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/Netflix/SimianArmy/issues/285#issuecomment-266650562 , or mute the thread https://github.com/notifications/unsubscribe-auth/ AXZt1KfkqVOCCOVHm23LK6QOxz0EGGUTks5rHjBdgaJpZM4LLW8B .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Netflix/SimianArmy/issues/285#issuecomment-266650754, or mute the thread https://github.com/notifications/unsubscribe-auth/AKXxgRG5GrhEMPM__SoelsuwKg9Xf0goks5rHjDDgaJpZM4LLW8B .

VinnieGogniti commented 7 years ago

Thanks for responding. I did replace that part (and everywhere else it's hardcoded too), but it still doesn't appear to work. Here is how I have it set in my code.

String defaultRegion = "us-west-2"; Region currentRegion = Regions.getCurrentRegion();

if (currentRegion != null) {
   //  defaultRegion = currentRegion.getName();
   defaultRegion = "us-west-2";
}

region = config.getStrOrElse("simianarmy.client.aws.region", defaultRegion);
GLOBAL_OWNER_TAGKEY = config.getStrOrElse("simianarmy.tags.owner", "owner");

=========================================================================== And of course, I didn't overlook the property that deals with region in client config. It is set for us-west-2 and it does get consumed by chaos monkey. I see that it fetches all the available auto-scaling groups in the west region and gets as far as to picking an instance randomly (in the specified ASG) to SSH into. But that's where it gets thrown a HTTP 401 from the east region, as you can see in the log in my first post.

simianarmy.client.aws.region = us-west-2

VinnieGogniti commented 7 years ago

@ebukoski I somehow feel that this is the particular piece of code that deals with the construction of ec2 client end-point - ec2.us-east-1.amazonaws.com, the one that was thrown in the HTTP 401 error. Please review and let me know your thoughts. https://github.com/Netflix/SimianArmy/blob/master/src/main/java/com/netflix/simianarmy/client/aws/AWSClient.java#L215

Error log: org.jclouds.rest.AuthorizationException: POST https://**ec2.us-east-1.amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized . . . 2016-12-13 05:24:07.089 - WARN ScriptChaosType - [ScriptChaosType.java:61] Strategy disabled because SSH credentials failed 2016-12-13 05:24:07.089 - WARN BasicChaosMonkey - [BasicChaosMonkey.java:124] No chaos type was applicable to the instance: i-009863xxxxxx 2016-12-13 05:24:07.205 - WARN ChaosInstance - [ChaosInstance.java:105] Error making SSH connection to instance org.jclouds.rest.AuthorizationException: POST https://ec2.us-east-1.amazonaws.com**/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized

VinnieGogniti commented 7 years ago

I am wiling to pay a reasonable amount for anyone who can fix this.

jsuh555 commented 7 years ago

I'm having the same problem, but I get Caused by: org.jclouds.http.HttpResponseException: request: POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 [Action=DescribeRegions] failed with response: HTTP/1.1 401 Unauthorized

I'm trying to see if there is something wrong with my IAM user or role permissions.

I wonder if the temporary cred retrieved from the Amazon STS service aren't valid right away and maybe require some time (few seconds?) before they work with the ec2 describe-regions api? Just guessing, I'm not an AWS expert by any means.

VinnieGogniti commented 7 years ago

If that HTTP 401 is being thrown from a different region than the one in your client config, then it certainly is a bug and it has been open and unaddressed for a very long time.

Ten48BASE commented 7 years ago

Have you two ensured that this property exists in your properties file and is being consumed by Chaos Monkey as Ed suggested: simianarmy.client.aws.region

Also, check out this Region Detection feature: https://github.com/Netflix/SimianArmy/pull/233

VinnieGogniti commented 7 years ago

Yes, I can see it consuming the region and detecting all auto-scaling groups available in that region, during startup. It actually gets as far as to picking an instance for executing a termination strategy, in that region. But that's where it gets thrown a HTTP 401 from a different region (us-east-1). I'm attaching the logs again for your reference.

AWSClient - [AWSClient.java:360] Got 37 auto-scaling groups in region us-west-2. . . . INFO BasicChaosInstanceSelector - [BasicChaosInstanceSelector.java:65] Randomly selecting 1 from 2 instances, excluding null INFO ScriptChaosType - [ScriptChaosType.java:73] Running script for BurnCpu on instance i-0995xxxx ERROR BasicChaosMonkey - [BasicChaosMonkey.java:201] failed to terminate instance i-0995xxxx org.jclouds.rest.AuthorizationException: POST https://**ec2.us-east-1**.amazonaws.com/ HTTP/1.1 -> HTTP/1.1 401 Unauthorized

Ten48BASE commented 7 years ago

Taking a shot in the dark here; if you look at the error it is an authentication error to the AWS API, not an error trying to actually make the SSH connection.

When connecting via SSH, Chaos Monkey sends only the instanceId to the connectSsh method, not the instanceId and region. It may be possible that the Apache Jcloud is querying multiple regions in an effort to locate the region of your instance so that it can query the instance to populate the NodeMetaData. Check this method: https://github.com/Netflix/SimianArmy/blob/master/src/main/java/com/netflix/simianarmy/client/aws/AWSClient.java#L880

Is it possible the IAM credentials your Monkey is using doesn't have read access to the API in the us-east-1 region? Are you restricting the regions to which the Monkey is allowed to query?

VinnieGogniti commented 7 years ago

Not that I'm aware of. I'm able to manually do a "aws ec2 describe-instances --region us-east-1" from the monkey instance on east region without any issues. Chaos Monkey instance role has full ec2 permissions over all regions and not restricted by any region as far as I can tell.

aws ec2 describe-instances --region us-east-1 Output: (Since nothing is running on east) { "Reservations": [] }

Is it possible to restrict Apache Jcloud to query only on the region specified in the AWS Client config, which is us-west-2 in this case?

jsuh555 commented 7 years ago

AWS tech tried to replicate my issue. They were only able to during use of IAM roles, but not when using regular user access key and secret key. He also couldn't see any api request being made, so it appears there is something wrong with the signature used when making the api request for describe-regions.

I should also mention I only get the 401 error when trying to elicit a terminate on demand via http POST

VinnieGogniti commented 7 years ago

When I use my AWS access and secret keys, it ends up failing at the step in creating SimpleDB domain at the wrong region (us-east-1), again. It doesn't seem to recognize that I have "us-west-2" region in my client config. Is there any way to make this monkey work at all?

WARN SimpleDBRecorder - [SimpleDBRecorder.java:287] Error while trying to auto-create SimpleDB domain com.amazonaws.services.simpledb.model.AmazonSimpleDBException: User (arn:aws:iam::xxxxx:user/xxxx) does not have permission to perform (sdb:ListDomains) on resource (arn:aws:sdb:us-east-1:xxxx:domain/). Contact account owner. (Service: AmazonSimpleDB; Status Code: 403; Error Code: AuthorizationFailure;

jsuh555 commented 7 years ago

There is something wrong with your amazon permissions.
I have no problems writing to and reading from simpleDB. I'm in region us-west-2, but this is not specified in my client.properties

Try doing this for your permissions. see attachment simpleDB_permissions.txt

VinnieGogniti commented 7 years ago

I got the following permissions, which basically has full EC2, ASG and SDB permissions, regardless of the region. My problem is - it is attempting to create SDB domain on the wrong region than the one specified in my client config, only when I used my AWS access and secret key for permissions.

{ "Statement": [ { "Sid": "Globals", "Action": [ "autoscaling:", "ec2:", "elasticloadbalancing:", "sdb:", "ses:SendEmail" ], "Effect": "Allow", "Resource": "*" } ] }

Error: User (arn:aws:iam::xxxxx:user/xxxx) does not have permission to perform (sdb:ListDomains) on resource (arn:aws:sdb:us-east-1:xxxx:domain/).

VinnieGogniti commented 7 years ago

I ran the build with extended logging enabled and I'm now able to see some new useful stack trace information which wasn't exposed before. At this point, I'm almost certain that the issue is within the Apache JCloud library, from where the code tries to make an AWS SDK call via the API - with the instance ID and credentials but somehow gets back a wrong region (may be default at the API) and gets thrown a 401 "unauthorized error" for the east region. May be the Apache JCloud or the AWS/EC2 API needs to be updated, but would that really solve the issue?

Any two useful cents, from anyone? How do I override it in the code to return "us-west-2"?

at org.jclouds.aws.ec2.compute.strategy.AWSEC2ListNodesStrategy.pollRunningInstances(AWSEC2ListNodesStrategy.java:65) 22:02:39.476 [QUIET] [system.out] at org.jclouds.ec2.compute.strategy.EC2ListNodesStrategy.listDetailsOnNodesMatching(EC2ListNodesStrategy.java:107) 22:02:39.476 [QUIET] [system.out] at org.jclouds.ec2.compute.strategy.EC2ListNodesStrategy.listNodes(EC2ListNodesStrategy.java:86) 22:02:39.476 [QUIET] [system.out] at org.jclouds.ec2.compute.strategy.EC2ListNodesStrategy.listNodes(EC2ListNodesStrategy.java:58) 22:02:39.476 [QUIET] [system.out] at org.jclouds.compute.internal.BaseComputeService.listNodes(BaseComputeService.java:335) 22:02:39.477 [QUIET] [system.out] at com.netflix.simianarmy.client.aws.AWSClient.getJcloudsNode(AWSClient.java:906) 22:02:39.477 [QUIET] [system.out] at com.netflix.simianarmy.client.aws.AWSClient.connectSsh(AWSClient.java:886) 22:02:39.477 [QUIET] [system.out] at com.netflix.simianarmy.chaos.ChaosInstance.connectSsh(ChaosInstance.java:123) 22:02:39.477 [QUIET] [system.out] at com.netflix.simianarmy.chaos.ChaosInstance.canConnectSsh(ChaosInstance.java:101) 22:02:39.477 [QUIET] [system.out] at com.netflix.simianarmy.chaos.ScriptChaosType.canApply(ScriptChaosType.java:60) 22:02:39.478 [QUIET] [system.out] at com.netflix.simianarmy.basic.chaos.BasicChaosMonkey.pickChaosType(BasicChaosMonkey.java:141) 22:02:39.478 [QUIET] [system.out] at . . . 22:02:39.480 [QUIET] [system.out] Caused by: org.jclouds.http.HttpResponseException: request: POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 [Action=DescribeRegions] failed with response: HTTP/1.1 401 Unauthorized

mlafeldt commented 7 years ago

To find the source of this problem, it might also help to use an artifact that is known to work, e.g. this Docker image: https://github.com/mlafeldt/docker-simianarmy

jsuh555 commented 7 years ago

This error (401 unauthorized) only occurs if I use IAM roles, but if I use the normal user access key and secret key, there are NO problems.

I created a basic jcloud project and I got the same issue if I use the access key for the role and the normal user id. Tried with normal user key and listNode() worked.

    ComputeServiceContext jcloudsContext = ContextBuilder//newBuilder("aws-ec2").newBuilder("aws-ec2").credentials("ASdsdsYdsdsdssdQ", "DdBz/PMcpr6Fkmpsdsdsds0Hxje")            .buildView(ComputeServiceContext.class);

    ComputeService client = jcloudsContext.getComputeService();
    Set<? extends ComputeMetadata> x = null;
    try {
        x = client.listNodes();
    }
    catch (Exception e){
        System.out.println("error");
    }

Maybe a bug in jclouds? Maybe a bug in aws sdk?

VinnieGogniti commented 7 years ago

That's what I think too!

jsuhhome commented 7 years ago

I don't have much time to look into it further, but I here are two things:

  1. when using a IAM role, simian army needs to pass the access id, secret key and the session token.
  2. From my non-exhaustive look, it appears simian army is doing this correctly. It's just that jclouds isn't sending the session token to amazon.
darrendao commented 7 years ago

I'm having the same problem and I'm thinking it might has to do with the fact that my chaos monkey is in a private subnet and has to go through a proxy to talk to AWS. For people having problem, is your setup similar?

jsuhhome commented 7 years ago

I didn't use a proxy. Are you using IAM roles or users? It works for me when using users

darrendao commented 7 years ago

Overall, there were multiple issues I ran into

  1. The way Chaos Monkey is using JClouds, it doesn't pass in the proxy info. So if Chaos Monkey is running behind a proxy, it will timeout when using JClouds to query AWS for instances to SSH into. I tried updating Chaos Monkey to pass in the proxy info into JClouds but wasn't able to successfully do it.
  2. JClouds doesn't seem to support implicit IAM roles. I have to end up updating client.properties to include the IAM access key and password.
  3. doMonkeyBusiness() method didn't seem to do anything. Same problem here: https://github.com/Netflix/SimianArmy/issues/274. Workaround in that thread works for me.
vermapratyush commented 7 years ago

I am facing the same problem. From what I was able to debug #274 mentions to exclude the dependency injection library to fix a version mismatch error. This possibly results in no values being injected from the properties file into org.jclouds library. Hence it defaults to us-east-1

ksolie commented 6 years ago

Is there any update on this issue? I am seeing similar failures running in us-east-1 region but I believe my issues are seen because jcloud doesn't use session tokens.