brandonhilkert / sucker_punch

Sucker Punch is a Ruby asynchronous processing library using concurrent-ruby, heavily influenced by Sidekiq and girl_friday.
MIT License
2.65k stars 114 forks source link

SSL_connect and process hangs #180

Closed atomical closed 8 years ago

atomical commented 8 years ago

Hi,

I'm experiencing an issue where the last log message I see from a thread is:

[2016-07-17T13:17:49.505 ERROR (15967) #] Gibbon::MailChimpError: #<Gibbon::MailChimpError: SSL_connect SYSCALL returned=5 errno=0 state=SSLv2/v3 read server hello A @title=nil, @detail=nil, @body=nil, @raw_body=nil, @status_code=nil>

After that the process hangs and consumes 80% of the memory available.

From strace:

[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
[pid 15968] poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)

For some reason this happens at a specific point in the execution when the script has been running for 12+ hours. I'm wondering if one of the underlying libraries use of Timeout could be the cause of this. I'm wrapping the worker in a begin rescue block with retry. Eventually my logger in the rescue block stops logging. I'm thinking there must be something going on when trying to connect with SSL. Any ideas?

I'm thinking something may have happened with the connection to the database (through activerecord) because I see attempts to write UPDATE statements, but it doesn't look like it's succeeding.

brandonhilkert commented 8 years ago

A few things come to mind, how MRI (if you're using MRI) uses DNS, take a look Sidekiq's FAQ about process hangs.

Also, I don't know what HTTP library gibbon uses under the hood, but an open/read timeout should be specified, b/c otherwise they could hang forever. Looks like it's 30 sec by default. Perhaps you could try something smaller and see if the behavior changes at all.

Lastly, your reference to the Timeout module could be relevant. Here's some more info on why not to use it. I didn't see it referenced in the gibbon gem, but that doesn't mean it's not in its dependencies.