brooklyncentral / brooklyn

This project has moved and is now part of the ASF
https://github.com/apache/incubator-brooklyn
72 stars 27 forks source link

Amazon EC2 "503 request limit exceeded" errors #470

Open ahgittin opened 11 years ago

ahgittin commented 11 years ago

Intermittently when running against AWS we get 503 errors back. jclouds correctly does retries with back-off logic but it's too little too late, and it eventually gives up and fails, with the end result that deployment fails.

Switching regions sometimes helps (us-east-1 seems the best but not always). But it would be good if we could somehow be less demanding in general.

ahgittin commented 11 years ago

Talking with AWS they say that we are making hundreds of security group or firewall calls which their algorithms rate more suspicious than VM calls. We should zero in on those to see whether we and/or jclouds are doing something wasteful in that area.

ahgittin commented 11 years ago

notes (and handy grep-fu) from Andrea:

if the query is correct, those should be the jclouds actions invoked by brooklyn during the web cluster db example:

$ cat brooklyn.log | grep ">> \"Action=*" | awk '{ print $9 }' | cut -d'=' -f2 | cut -d'&' -f1| sort | uniq -c 12 AuthorizeSecurityGroupIngress 3 CreateSecurityGroup 3 CreateTags 2 DeleteKeyPair 5 DeleteSecurityGroup 32 DescribeAvailabilityZones 14 DescribeImages 124 DescribeInstances 2 DescribeKeyPairs 3 DescribePlacementGroups 4 DescribeRegions 11 DescribeSecurityGroups 3 ImportKeyPair 3 RunInstances 3 TerminateInstances

where in total, we have $ cat brooklyn.log | grep ">> \"Action=*" | awk '{ print $9 }' | cut -d'=' -f2 | cut -d'&' -f1| wc -l 224

I've asked also to jclouds IRC but no updates on that issue. So I've re-opened the same issue https://github.com/jclouds/jclouds/issues/1214

My wild guess, reading the post-mortem analysis of Eugen, is the following:

brooklyn selects an image using jclouds. The AMI depends on the regions and as soon as a "standard" AMI is selected (in my experiments today I've always picked Rightscale's), jclouds is able to connect to the node.

Maybe if another AMI is chosen, it could happen that jclouds doesn't know the login user (as for private images) and then this 503 error arises.

I'm afraid that we need to fix that on jclouds rather than on brooklyn.

ahgittin commented 11 years ago

with brooklyn-cdh we have:

cat brooklyn-cdh.log | grep ">> \"Action=*" | awk '{ print $9 }' | cut -d'=' -f2 | cut -d'&' -f1| sort | uniq -c 72 AuthorizeSecurityGroupIngress 1 CreateKeyPair 5 CreateSecurityGroup 5 CreateTags 48 DescribeAvailabilityZones 24 DescribeImages 219 DescribeInstances 8 DescribeRegions 8 DescribeSecurityGroups 4 ImportKeyPair 5 RunInstances

The 72 security group ingresses consist of 13 calls per CDH node (4 of them) to open ports there (one port per call), plus a similar number for manager, plus more per-node calls to open the security group between each node and the manager.

The API allows port ranges but not a set of ports, and our ports are discrete, so we need the ~13 calls. However we could potentially re-use the security group across nodes. Anyone care to spike that? (The difficulty is probably that security group is not a portable concept in jclouds, and possibly not even outwith jclouds.)

aledsage commented 11 years ago

Conversation with AWS about this is continuing.

Long-term, it feels like best place is to do exponential backoff / retry in jclouds. Currently in jclouds, CreateSecurityGroupIfNeeded.createSecurityGroupInRegion does a tight-loop over the ports which calls securityClient.authorizeSecurityGroupIngressInRegion for each. Doing the retry at the level of SecurityGroupClient or SecurityGroupAsyncClient might make sense.

I'll ask on jclouds IRC once the US comes on line, and try to figure out how much work this would be.

aledsage commented 11 years ago

The AuthorizeSecurityGroupIngress command can take non-contiguous ports (see http://docs.aws.amazon.com/AWSEC2/latest/APIReference/ApiReference-query-AuthorizeSecurityGroupIngress.html, where you use IpPermissions.n for different values of n).

In jclouds 1.6, there are big improvements to securityClient.authorizeSecurityGroupIngressInRegion so that it can take an Iterable<IpPermission> to create the security group with a single call. Switching to that I expect would solve our problems!

On IRC, Adrian also reports for jclouds: there's a couple more things to do, but in the next week or so, should have a straight-forward way to slow down calls based on command name