Communication failures with SWF server cause several issues

BiggerNoise commented 11 years ago

There's an exhaustive (and exhausting) discussion of this on the dev forum: https://forums.aws.amazon.com/thread.jspa?messageID=484089

Essentially, if there are any communication issues with the SWF servers several issues can show up:

workflows or activities fail to register when workers for new versions are started
workflow or activity workers will crash
activity workers will finish a task and then get an error when reporting that they are done, so the entire execution fails

The third has been causing us the most grief. Another user tracked it down to the task poller where, indeed, if there is a communication issue telling the server that a task has completed, it will rescue from the error and then mark the task as failed.

I am not sure what causes the communication to be so flaky, but it definitely shows up across machines and from different ISPs.

I am going to be putting together a pull request to implement the behavior that a worker should retry its communications until its timeout expires. A comm failure should never be interpreted as a task failure.

At any rate, if there's anything that you would like me to pay extra attention to, please let me know.

BiggerNoise commented 11 years ago

Here's a gist (https://gist.github.com/BiggerNoise/6442159) of where I think the execute method should be heading.

Unfortunately, I am having a very hard time writing a worthwhile test for this. I was hoping that I could mock the communication and then execute a single activity and verify that it retries the communication. However, as I try to get this working, I have that sinking feeling that I am mocking way too much.

Add that to the fact that the service is being recreated during process_single_task and I am thinking that a plan B might be warranted, but I am not sure what that would consist of.

BiggerNoise commented 11 years ago

Saw a new failure mode today, the communications failed when starting a new workflow.

I just wanted to throw out there that this going to make it impossible to use the ruby-flow framework in our project. We'd really like to, but we really need to get past these issues.

We have tried different ISPs and different computers, let me throw out a few more things and see if any of them raise an eyebrow:

The activities live in a Rails application (3.2.14) The rails app uses MongoDB for its persistence We're running OSX (10.7 & 10.8)

mjsteger commented 11 years ago

Hey BiggerNoise,

Typing up a bigger, more thorough response, but as a quick stopgap measure to try to help, have you attempted to change the max retries on the aws-sdk? Since these are communication issues, and we mediate through the aws-sdk, this might help.

BiggerNoise commented 11 years ago

I'm re-running our job now. I will let you know if that helps

mjsteger commented 11 years ago

Other questions:

Can you try running the workflow code outside of a rails context?

Are you using the ruby stdlib Timeout::timeout at all? Looking at this blog post, it looks like this could explain the "execution expired" message(as well as the problem popping up in multiple locations).

BiggerNoise commented 11 years ago

I don't think that we are making any use of Timeout::timeout. Unfortunately, moving my code outside of Rails is not very feasible (it uses a bunch of the models and other support classes).

However, touch wood, but upping the retries seems to have helped considerably. I didn't bother with that before because I thought the default was three, but I set it to four and I haven't seen the polling issue.

I even managed to get to a point where my code was the thing breaking. I will reply back tomorrow and let you know if it seems to be holding together.

Thanks, Andy

BiggerNoise commented 11 years ago

Mixed bag of results. I was able to get a job to complete last night after I upped the retries.

However, come this morning, I am getting exceptions in my workflow worker. So, I don't know that the communication issues are really solved, or if it was just the fact that I was running at night when things might have been a bit quieter.

BiggerNoise commented 11 years ago

OK. This is repeatable with just the sample application. Here is the entire backtrace. This happened about ten minutes after starting the application:

andy@Andy-MBP:lib $ruby deployment_workflow.rb
/Users/andy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/net/http.rb:763:in `initialize': execution expired (Timeout::Error)
    from /Users/andy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/net/http.rb:763:in `open'
    from /Users/andy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/net/http.rb:763:in `block in connect'
    from /Users/andy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/net/http.rb:763:in `connect'
    from /Users/andy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/net/http.rb:756:in `do_start'
    from /Users/andy/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/net/http.rb:751:in `start'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/http/connection_pool.rb:301:in `start_session'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/http/connection_pool.rb:125:in `session_for'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/http/net_http_handler.rb:55:in `handle'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/client.rb:244:in `block in make_sync_request'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/client.rb:280:in `retry_server_errors'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/client.rb:240:in `make_sync_request'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/client.rb:502:in `block (2 levels) in client_request'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/client.rb:382:in `log_client_request'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/client.rb:468:in `block in client_request'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/client.rb:364:in `return_or_raise'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/core/client.rb:467:in `client_request'
    from (eval):3:in `poll_for_decision_task'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-sdk-1.16.1/lib/aws/simple_workflow/decision_task_collection.rb:172:in `poll_for_single_task'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-flow-1.0.2/lib/aws/decider/task_poller.rb:54:in `get_decision_tasks'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-flow-1.0.2/lib/aws/decider/task_poller.rb:61:in `poll_and_process_single_task'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-flow-1.0.2/lib/aws/decider/worker.rb:200:in `run_once'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-flow-1.0.2/lib/aws/decider/worker.rb:186:in `block in start'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-flow-1.0.2/lib/aws/decider/worker.rb:185:in `loop'
    from /Users/andy/.rvm/gems/ruby-1.9.3-p448/gems/aws-flow-1.0.2/lib/aws/decider/worker.rb:185:in `start'
    from deployment_workflow.rb:172:in `<main>'

mjsteger commented 11 years ago

I tried to reproduce your error in the following two ways(running ruby-1.9.3-p448 on an ec2 micro):

Running "ruby deployment_workflow.rb" and then starting a single workflow_starter and activity worker in the background, and then waiting more than 15 minutes.

Running "ruby deployment_workflow.rb" and then running an activity worker and enough workflow_starters in the background to keep the worker saturated.

In both cases, I was unable to get the Timeout::Error you encountered.

From your stack trace, the error originates from "http start" in the aws-sdk. Looking at ruby/1.9.1/net/http.rb:763, it appears that the open_timeout is firing. By default, this timeout is set to 15 seconds by default in the aws-sdk.

The http_open_timeout fires "If the HTTP object cannot open a connection in [15] seconds".

Here are the options we have considered:

Our system has maxed out its socket connections on our servers, or is in some other way preventing your machine from opening connections to us. We are currently investigating this, but based on lack of other customers reporting impact, this appears less likely.
Your machine has exhausted its sockets. One supporting piece of evidence is that it works better at night - when, presumably, less sockets are being used.

You can check how many TCP sockets you have open with:

netstat -at | wc -l

One quick mitigation you can try while researching why you are getting this error, if your application allows, is to increase your http_open_timeout. You can do so as follows:

AWS.config(:http_open_timeout => $APPROPRIATE_VALUE_HIGHER_THAN_15}

Though this is at best a temporary fix - we need to find out why the timeout is getting exceeded.

I'd recommend setting :http_wire_trace to true in order to further drill down further

AWS.config(:http_wire_trace => true)

You can set :logger if you wish the output to go somewhere other than stdout.

BiggerNoise commented 11 years ago

We are seeing anything from 500-750 for the socket count (polling all the macs in the dev group). Nobody is having any difficultly doing any other activity that requires opening sockets.

I'm also not sure why you think that a machine would use fewer sockets at night. That seems reasonable for a server, but our development laptops pretty much stay in the same state regardless of the time of day.

One thing I am curious about. I added the wire trace to the sample and started that up. It has not crashed. Curious, I started our ingress job which now seems to be running. Did you guys make a change on your server side?

BiggerNoise commented 11 years ago

I did want to follow up again. I have been hammering at this thing all day and have not had a single timeout related issue.

I'd definitely feel better about this if you told me that Amazon had made some tweak that might have affected SWF communications.

mjsteger commented 11 years ago

We are not aware of any changes on our end that would have affected SWF communications. We also double-checked, and there were no ongoing issues with the service in the last few days. One thought that comes to mind is that you have changed a variable by enabling logging - will removing the logging cause the sample to crash again?(it would admittedly be a very troubling bug, but I'm at a bit of a loss to otherwise explain why everything is now working when it was not yesterday).

I'm not sure how we can help you further without a reproduction case. If you run into this again, feel free to re-open with a reproduction. Otherwise, I think it makes sense to resolve for now.

BiggerNoise commented 11 years ago

Our ingress code did not have the logging enabled, so nothing changed from completely unusable on Thursday and rock solid on Friday. Oh, how I love software development.

I will report back if I encounter this again.

BiggerNoise commented 11 years ago

This is still an issue. The problem definitely seems to follow business hours. It ran fine all weekend and then started failing Monday morning around 10:30

I went ahead and increased the connect timeout to 20 seconds and the retry count to 9. I am seeing the software trying to establish a connection with swf.us-east-1.amazonaws.com and that initial hand shake taking over a minute:

This is with the distribution sample app, options modified as such:

AWS.config(YAML.load(config_file).merge(http_wire_trace: true, http_open_timeout: 20, max_retries: 9))

andy@Andy-MBP:lib $ruby deployment_workflow.rb
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opened
<- "POST / HTTP/1.1\r\nContent-Type: application/x-amz-json-1.0\r\nAccept-Encoding: \r\nX-Amz-Target: SimpleWorkflowService.RegisterDomain\r\nContent-Length: 74\r\nUser-Agent: ruby-flow aws-sdk-ruby/1.16.1 ruby/1.9.3 x86_64-darwin11.4.2\r\nHost: swf.us-east-1.amazonaws.com\r\nX-Amz-Date: 20130909T181118Z\r\nX-Amz-Content-Sha256: eexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxee646653e1a1d1\r\nAuthorization: AWS4-HMAC-SHA256 Credential=AxxxxxxxxxxxxxxxxxxA/20130909/us-east-1/swf/aws4_request, SignedHeaders=content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-target, Signature=5cxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx3\r\nAccept: */*\r\n\r\n"
<- "{\"name\":\"DEPLOYMENT_DOMAIN\",\"workflowExecutionRetentionPeriodInDays\":\"10\"}"
-> "HTTP/1.1 400 Bad Request\r\n"
-> "x-amzn-RequestId: 39c106e3-197b-11e3-aa7d-9996ddbddd5b\r\n"
-> "Content-Type: application/x-amz-json-1.0\r\n"
-> "Content-Length: 96\r\n"
-> "\r\n"
reading 96 bytes...
-> "{\"__type\":\"com.amazonaws.swf.base.model#DomainAlreadyExistsFault\",\"message\":\"DEPLOYMENT_DOMAIN\"}"
read 96 bytes
Conn keep-alive

FWIW, here is the trace route from our office to swf.us-east-1:

andy@Andy-MBP:~ $traceroute swf.us-east-1.amazonaws.com
traceroute to swf.us-east-1.amazonaws.com (205.251.242.13), 64 hops max, 52 byte packets
 1  192.168.1.1 (192.168.1.1)  1.046 ms  0.845 ms  0.808 ms
 2  rrcs-97-77-64-1.sw.biz.rr.com (97.77.64.1)  42.725 ms
    tge7-2.crtntxjt01h.texas.rr.com (24.164.209.33)  10.185 ms  11.824 ms
 3  tge0-9-0-12.crtntxjt01r.texas.rr.com (24.175.50.108)  14.814 ms  16.135 ms  16.234 ms
 4  agg21.dllatxl3-cr01.texas.rr.com (24.175.49.0)  22.377 ms  12.716 ms  32.296 ms
 5  107.14.17.136 (107.14.17.136)  14.204 ms  15.361 ms  15.703 ms
 6  ae1.pr1.dfw10.tbone.rr.com (107.14.17.234)  12.888 ms  11.536 ms
    ae0.pr1.dfw10.tbone.rr.com (107.14.17.232)  13.850 ms
 7  dls-bb1-link.telia.net (213.248.89.197)  10.506 ms
    dls-bb1-link.telia.net (213.248.99.213)  76.522 ms
    dls-bb1-link.telia.net (213.248.66.69)  11.550 ms
 8  ash-bb4-link.telia.net (213.155.133.178)  82.920 ms  43.895 ms  47.272 ms
 9  ash-b1-link.telia.net (80.91.248.161)  93.566 ms
    ash-b1-link.telia.net (213.155.130.59)  140.742 ms  40.417 ms
10  vadata-ic-157233-ash-bb1.c.telia.net (62.115.9.70)  47.005 ms
    vadata-ic-157229-ash-bb1.c.telia.net (80.239.193.210)  46.109 ms
    vadata-ic-157233-ash-bb1.c.telia.net (62.115.9.70)  46.091 ms
11  205.251.245.1 (205.251.245.1)  73.201 ms
    205.251.245.5 (205.251.245.5)  56.799 ms
    205.251.245.1 (205.251.245.1)  46.518 ms
12  205.251.245.123 (205.251.245.123)  40.825 ms  46.847 ms  51.975 ms
13  * * *

BiggerNoise commented 11 years ago

One more dump.

This is the same sample app. You can see how the number of attempts required to establish a connection keeps going up and down.

Conn keep-alive
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opened
<- "POST / HTTP/1.1\r\nContent-Type: application/x-amz-json-1.0\r\nAccept-Encoding: \r\nX-Amz-Target: SimpleWorkflowService.PollForDecisionTask\r\nContent-Length: 160\r\nUser-Agent: ruby-flow aws-sdk-ruby/1.16.1 ruby/1.9.3 x86_64-darwin11.4.2\r\nHost: swf.us-east-1.amazonaws.com\r\nX-Amz-Date: 20130909T191246Z"
<- "{\"domain\":\"DEPLOYMENT_DOMAIN\",\"identity\":\"Andy-MBP.local:30005\",\"taskList\":{\"name\":\"deployment_workflow_task_list\"},\"maximumPageSize\":1000,\"reverseOrder\":false}"
-> "HTTP/1.1 200 OK\r\n"
-> "x-amzn-RequestId: cf7295e1-1983-11e3-b1e7-239432adb8ec\r\n"
-> "Content-Type: application/x-amz-json-1.0\r\n"
-> "Content-Length: 47\r\n"
-> "\r\n"
reading 47 bytes...
-> "{\"previousStartedEventId\":0,\"startedEventId\":0}"
read 47 bytes
Conn keep-alive
opening connection to swf.us-east-1.amazonaws.com...
opened
<- "POST / HTTP/1.1\r\nContent-Type: application/x-amz-json-1.0\r\nAccept-Encoding: \r\nX-Amz-Target: SimpleWorkflowService.PollForDecisionTask\r\nContent-Length: 160\r\nUser-Agent: ruby-flow aws-sdk-ruby/1.16.1 ruby/1.9.3 x86_64-darwin11.4.2\r\nHost: swf.us-east-1.amazonaws.com\r\nX-Amz-Date: 20130909T191347Z"
<- "{\"domain\":\"DEPLOYMENT_DOMAIN\",\"identity\":\"Andy-MBP.local:30005\",\"taskList\":{\"name\":\"deployment_workflow_task_list\"},\"maximumPageSize\":1000,\"reverseOrder\":false}"
-> "HTTP/1.1 200 OK\r\n"
-> "x-amzn-RequestId: f3ed44c9-1983-11e3-83a8-cda45b58a7b3\r\n"
-> "Content-Type: application/x-amz-json-1.0\r\n"
-> "Content-Length: 47\r\n"
-> "\r\n"
reading 47 bytes...
-> "{\"previousStartedEventId\":0,\"startedEventId\":0}"
read 47 bytes
Conn keep-alive
opening connection to swf.us-east-1.amazonaws.com...
opened
<- "POST / HTTP/1.1\r\nContent-Type: application/x-amz-json-1.0\r\nAccept-Encoding: \r\nX-Amz-Target: SimpleWorkflowService.PollForDecisionTask\r\nContent-Length: 160\r\nUser-Agent: ruby-flow aws-sdk-ruby/1.16.1 ruby/1.9.3 x86_64-darwin11.4.2\r\nHost: swf.us-east-1.amazonaws.com\r\nX-Amz-Date: 20130909T191448Z"
<- "{\"domain\":\"DEPLOYMENT_DOMAIN\",\"identity\":\"Andy-MBP.local:30005\",\"taskList\":{\"name\":\"deployment_workflow_task_list\"},\"maximumPageSize\":1000,\"reverseOrder\":false}"
-> "HTTP/1.1 200 OK\r\n"
-> "x-amzn-RequestId: 182768cb-1984-11e3-a96b-fb83864e2366\r\n"
-> "Content-Type: application/x-amz-json-1.0\r\n"
-> "Content-Length: 47\r\n"
-> "\r\n"
reading 47 bytes...
-> "{\"previousStartedEventId\":0,\"startedEventId\":0}"
read 47 bytes
Conn keep-alive
opening connection to swf.us-east-1.amazonaws.com...
opened
<- "POST / HTTP/1.1\r\nContent-Type: application/x-amz-json-1.0\r\nAccept-Encoding: \r\nX-Amz-Target: SimpleWorkflowService.PollForDecisionTask\r\nContent-Length: 160\r\nUser-Agent: ruby-flow aws-sdk-ruby/1.16.1 ruby/1.9.3 x86_64-darwin11.4.2\r\nHost: swf.us-east-1.amazonaws.com\r\nX-Amz-Date: 20130909T191549Z"
<- "{\"domain\":\"DEPLOYMENT_DOMAIN\",\"identity\":\"Andy-MBP.local:30005\",\"taskList\":{\"name\":\"deployment_workflow_task_list\"},\"maximumPageSize\":1000,\"reverseOrder\":false}"
-> "HTTP/1.1 200 OK\r\n"
-> "x-amzn-RequestId: 3c8567ff-1984-11e3-8c2b-d3905e8b42a4\r\n"
-> "Content-Type: application/x-amz-json-1.0\r\n"
-> "Content-Length: 47\r\n"
-> "\r\n"
reading 47 bytes...
-> "{\"previousStartedEventId\":0,\"startedEventId\":0}"
read 47 bytes
Conn keep-alive
opening connection to swf.us-east-1.amazonaws.com...
opened
<- "POST / HTTP/1.1\r\nContent-Type: application/x-amz-json-1.0\r\nAccept-Encoding: \r\nX-Amz-Target: SimpleWorkflowService.PollForDecisionTask\r\nContent-Length: 160\r\nUser-Agent: ruby-flow aws-sdk-ruby/1.16.1 ruby/1.9.3 x86_64-darwin11.4.2\r\nHost: swf.us-east-1.amazonaws.com\r\nX-Amz-Date: 20130909T191650Z"
<- "{\"domain\":\"DEPLOYMENT_DOMAIN\",\"identity\":\"Andy-MBP.local:30005\",\"taskList\":{\"name\":\"deployment_workflow_task_list\"},\"maximumPageSize\":1000,\"reverseOrder\":false}"
-> "HTTP/1.1 200 OK\r\n"
-> "x-amzn-RequestId: 60e5ff2a-1984-11e3-8255-c54ee713cd2c\r\n"
-> "Content-Type: application/x-amz-json-1.0\r\n"
-> "Content-Length: 47\r\n"
-> "\r\n"
reading 47 bytes...
-> "{\"previousStartedEventId\":0,\"startedEventId\":0}"
read 47 bytes
Conn keep-alive
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opening connection to swf.us-east-1.amazonaws.com...
opened
<- "POST / HTTP/1.1\r\nContent-Type: application/x-amz-json-1.0\r\nAccept-Encoding: \r\nX-Amz-Target: SimpleWorkflowService.PollForDecisionTask\r\nContent-Length: 160\r\nUser-Agent: ruby-flow aws-sdk-ruby/1.16.1 ruby/1.9.3 x86_64-darwin11.4.2\r\nHost: swf.us-east-1.amazonaws.com\r\nX-Amz-Date: 20130909T191915Z"
<- "{\"domain\":\"DEPLOYMENT_DOMAIN\",\"identity\":\"Andy-MBP.local:30005\",\"taskList\":{\"name\":\"deployment_workflow_task_list\"},\"maximumPageSize\":1000,\"reverseOrder\":false}"
-> "HTTP/1.1 200 OK\r\n"
-> "x-amzn-RequestId: b7a8b10f-1984-11e3-8032-71233513dacd\r\n"
-> "Content-Type: application/x-amz-json-1.0\r\n"
-> "Content-Length: 47\r\n"
-> "\r\n"
reading 47 bytes...
-> "{\"previousStartedEventId\":0,\"startedEventId\":0}"
read 47 bytes
Conn keep-alive
opening connection to swf.us-east-1.amazonaws.com...
opened
<- "POST / HTTP/1.1\r\nContent-Type: application/x-amz-json-1.0\r\nAccept-Encoding: \r\nX-Amz-Target: SimpleWorkflowService.PollForDecisionTask\r\nContent-Length: 160\r\nUser-Agent: ruby-flow aws-sdk-ruby/1.16.1 ruby/1.9.3 x86_64-darwin11.4.2\r\nHost: swf.us-east-1.amazonaws.com\r\nX-Amz-Date: 20130909T192016Z"
<- "{\"domain\":\"DEPLOYMENT_DOMAIN\",\"identity\":\"Andy-MBP.local:30005\",\"taskList\":{\"name\":\"deployment_workflow_task_list\"},\"maximumPageSize\":1000,\"reverseOrder\":false}"

mjsteger commented 11 years ago

We'll need a more concrete reproduction case to help us move forward with this issue. It seems you are having trouble running our samples, which we run on a regular basis without any issues. I suspect there is some confounding variable that is not yet visible.

Can you try running the sample on a clean machine outside of your current network(ec2 instance?) and see if you get the same problem? As I mentioned previously, I tried reproducing on a micro with two guessed configurations that are causing you the error, and did not get errors from either configuration. In this way we'll be able to determine if your local network or machine is a potential problematic variable.

BiggerNoise commented 11 years ago

The culprit, near as I can tell, is Mac OS. We had tried several different machines on a combination of Time Warner Business Cable (the office), and Verizon FiOS (what every dev has at home). We saw the problem everywhere.

However, I installed a fresh Ubuntu 13 onto a spare laptop at home and it seems to have run all day without ever having to retry. Unfortunately, I forgot to run the test on the mac at the same time, but it certainly seems that the issue is in a pretty aluminum case.

For our development purposes, increasing the number of retries seems to have addressed the issue. We don't expect this to be an issue at all in production (linux and running in the amazon network).

If there's anything I can do to firm up advice for mac users, please let me know. I will let you know if I do encounter any further issues with the Linux box, but for now, I think we can close this one.

mjsteger commented 11 years ago

That's really odd; my dev laptop is OSX(10.7), and so I develop on it a lot, and frequently run tests/samples on it. Glad to hear it won't be an issue in production, but it's troubling that root cause is still unclear. If other people are having problems similar to this on OSX, please let me know!

amazon-archives / aws-flow-ruby

Communication failures with SWF server cause several issues #4