amazon-archives / aws-flow-ruby

ARCHIVED
137 stars 58 forks source link

Starting worker with should_register set to true causes ThrottlingExceptions #82

Open Tsquare opened 9 years ago

Tsquare commented 9 years ago

We have tens of workflow types and hundreds of activity types, which if we try to register on activity/workflow worker startup end up sending one request per type to SWF in rapid succession, causing:

AWS::SimpleWorkflow::Errors::ThrottlingException Rate exceeded

There should be a way to avoid registering already-registered types (which is most of them).

pmohan6 commented 9 years ago

Flow could potentially do this in 3 ways - 1) Try to register the workflow/activity and if it fails with TypeAlreadyExists, then continue on (this is what flow does currently). 2) Try to call describe_workflow/activity_type on each type and see if it exists and then try to register. Doesn't really solve the problem because we are shifting the load from Register to Describe. 3) Try to call list_workflow/activity_type on each domain. While this could potentially reduce the number of calls, it is not guaranteed since List is a best effort call. Moreover, if the domain already has a lot of types, the call will have to page through to get the entire list. And if you have a lot of workers starting up, you are likely to get throttled on this too.

We currently recommend users to start their workers in a staggered way instead of all at once.

You can also request a limit increase from SWF here.

Tsquare commented 9 years ago

Can't you fetch all the known activity types using a single call:

http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/SimpleWorkflow/Domain.html#activity_types-instance_method?

pmohan6 commented 9 years ago

Domain#activity_types is just a ruby sdk abstraction that sits on top of the SWF client. It calls #list_activity_types internally and pages through all the results to return back the entire list.

Tsquare commented 9 years ago

My reading of http://docs.aws.amazon.com/amazonswf/latest/apireference/API_ListActivityTypes.html is that it returns all the types if you don't specify a name. Is that wrong?

pmohan6 commented 9 years ago

Right, it returns all the types if you don't specify a name. What I meant to say was if the list is large enough, it will make multiple calls to page through the entire list. List is also a slower and potentially more expensive call than Register. It is difficult to predict which one will cost the customers less since it depends on the usage scenario. In our experience, register method seems to come out cheaper.

Tsquare commented 9 years ago

But it sounds like each page returns up to 100 types -- so instead of 100 register calls, you'd have a single list call, followed by checking which types are still unregistered and only registering those (and so on for each page). In term of # of calls that seems like a large savings in the typical case (in which a worker restarts and most of the types are already registered).

pmohan6 commented 9 years ago

We will explore using List before trying to register and look at the cost difference between the two methods.

pmohan6 commented 9 years ago

We have changed the implementation of the runner to register only using the first worker for each set of workers. This should reduce the number of register calls by a factor of number_of_workers if you are using the runner to start your workers.

Closing the issue for now but please feel free to reopen if you think this doesn't solve the issue. Thanks!

https://github.com/aws/aws-flow-ruby/commit/396dd80ee0864582dcb4daf2fd66da9c986d5708

Tsquare commented 9 years ago

we've already implemented this fix ourselves and unfortunately it doesn't resolve the issue. (i can't reopen since i'm not a collaborator on this repo)