documentcloud / cloud-crowd

Parallel Processing for the Rest of Us
https://github.com/documentcloud/cloud-crowd/wiki
MIT License
851 stars 92 forks source link

"Unexpected error while processing request: execution expired" and " Unexpected error while processing request: getaddrinfo: Name or service not known" during word_count_example.rb and other jobs #23

Closed jbfink closed 14 years ago

jbfink commented 14 years ago

Hey folks,

I've got a small cluster of two OSX machines and one Linux box with another Linux box as a controller, all running Crowd 0.5.0. About half the time when I try to start jobs -- even simple ones like the Shakespeare word count -- I get the controller box crashing with errors like:

!! Unexpected error while processing request: execution expired !! Unexpected error while processing request: getaddrinfo: Name or service not known

I stop the controller, rerun crowd load_schema*, and I start the controller again -- sometimes this works, sometimes it doesn't. As far as I can tell there's no lingering thin or crowd process running on the controller, so I'm not sure where the problem is coming from.

*note that I have a mysql database instance, but have been using the crowd load_schema command to effect a reset of sorts -- if this is wrong behaviour, please let me know.

jbfink commented 14 years ago

I should add also that I occasionally (though not always, frustratingly enough) get the "/var/lib/gems/1.8/gems/rest-client-1.5.1/lib/restclient/request.rb:145:in `transmit': RestClient::ServerBrokeConnection" error too.

jashkenas commented 14 years ago

Sorry I didn't see this ticket until now ... That looks like a connectivity problem, no? Are you running the jobs over wifi or some sort of VPN?

Also, considering that you asked two weeks ago, did you ever get this sorted out?

jbfink commented 14 years ago

Nope, not wifi and not VPN. And no, didn't get it sorted out either. I did find a gist where someone had the same problem and it might be a rack issue?

jbfink commented 14 years ago

Although interestingly enough we do have a crappy network topography that sometimes stalls on transfers of very large files. It never stays stalled out, but it does make transferring things over rsync/scp very annoying. Perhaps there's something I can do about cloud-crowd's tolerance of flaky links? Increase a timeout or something?

jashkenas commented 14 years ago

I'm not sure -- we use the RestClient gem to do internal communication between the server and the nodes. Perhaps there's a patch that can be made there -- you can try setting the "open_timeout" option, and see if it helps your issue. I think that the first step would be to reliably reproduce the problem...

jbfink commented 14 years ago

Is the open_timeout option in RestClient or somewhere in a cloud-crowd config?

jashkenas commented 14 years ago

It's in RestClient, check out the docs.

http://rdoc.info/rdoc/archiloque/rest-client/blob/6079fb070dc8b7a645dbd806e696c057afab1f5d/RestClient/Resource.html

You'd patch your install of CloudCrowd to set it.