alexdalitz / dnsruby

Dnsruby is a feature-complete DNS(SEC) client for Ruby, as used by many of the world's largest DNS registries and the OpenDNSSEC project
Other
194 stars 77 forks source link

Stuck in recursor #171

Closed seandilda closed 2 years ago

seandilda commented 3 years ago

I've been using Dnsruby::Recursor for years now without issue. I've recently started seeing issues where my process using Dnsruby::Recursor gets stuck.

After its stuck, I see it has printed out the following:

#<Thread:0x00007efe71800cc0 /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:151 run> terminated with exception (report_on_exception is true):
/bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/resolver.rb:1081:in `block in handle_queue_event': undefined method `log_and_raise' for #<Dnsruby::ResolverRuby:0x00007efe727b1930> (NoMethodError)
    from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/resolver.rb:1071:in `synchronize'
    from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/resolver.rb:1071:in `handle_queue_event'
    from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:795:in `notify_queue_observers'
    from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:631:in `send_exception_to_client'
    from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:409:in `block in process_timeouts'
    from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:407:in `each'
    from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:407:in `process_timeouts'
    from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:182:in `do_select'
    from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:152:in `block in check_select_thread_synchronized'

That is the entirety of the message and the process seems stuck after this.

While the process is stuck, I ran netstat in the container and noticed several open UDP ports to port 53 on nine external servers. Four of those connections has 2176 in the Recv-Q, all the others had 0.

alexdalitz commented 3 years ago

Hi Sean -

Thanks for your message!

I’ve pushed a change to the master branch of git which should get the log_and_raise method to work - sorry about that.

It sounds like something quite unusual is going on. Could you please describe your application a little more?

Thanks,

Alex.

On 16 Feb 2021, at 13:39, Sean Dilda notifications@github.com wrote:

I've been using Dnsruby::Recursor for years now without issue. I've recently started seeing issues where my process using Dnsruby::Recursor gets stuck.

After its stuck, I see it has printed out the following:

<Thread:0x00007efe71800cc0 /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:151 run> terminated with exception (report_on_exception is true):

/bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/resolver.rb:1081:in block in handle_queue_event': undefined methodlog_and_raise' for # (NoMethodError) from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/resolver.rb:1071:in synchronize' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/resolver.rb:1071:inhandle_queue_event' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:795:in notify_queue_observers' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:631:insend_exception_to_client' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:409:in block in process_timeouts' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:407:ineach' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:407:in process_timeouts' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:182:indo_select' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:152:in `block in check_select_thread_synchronized' That is the entirety of the message and the process seems stuck after this.

While the process is stuck, I ran netstat in the container and noticed several open UDP ports to port 53 on nine external servers. Four of those connections has 2176 in the Recv-Q, all the others had 0.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alexdalitz/dnsruby/issues/171, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2WFWR5KXIW455VEXTBTUTS7JYPPANCNFSM4XWPCHYQ.

alexdalitz commented 3 years ago

Going back through the code, it looks like this log-as-error was accidentally changed to raise an error in a previous commit (although that was back in 2014…). I’ve restored the original code which just logged, and doesn’t raise an error. There’s certainly something odd going on to trigger that line (which clearly doesn’t get called often!) - but it’s nothing the system can’t recover from.

Hopefully the master branch changes will fix your issues - please let me know!

Thanks!

Alex.

On 16 Feb 2021, at 14:57, Alex Dalitz alex@caerkettontech.com wrote:

Hi Sean -

Thanks for your message!

I’ve pushed a change to the master branch of git which should get the log_and_raise method to work - sorry about that.

It sounds like something quite unusual is going on. Could you please describe your application a little more?

Thanks,

Alex.

On 16 Feb 2021, at 13:39, Sean Dilda <notifications@github.com mailto:notifications@github.com> wrote:

I've been using Dnsruby::Recursor for years now without issue. I've recently started seeing issues where my process using Dnsruby::Recursor gets stuck.

After its stuck, I see it has printed out the following:

<Thread:0x00007efe71800cc0 /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:151 run> terminated with exception (report_on_exception is true):

/bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/resolver.rb:1081:in block in handle_queue_event': undefined methodlog_and_raise' for # (NoMethodError) from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/resolver.rb:1071:in synchronize' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/resolver.rb:1071:inhandle_queue_event' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:795:in notify_queue_observers' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:631:insend_exception_to_client' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:409:in block in process_timeouts' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:407:ineach' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:407:in process_timeouts' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:182:indo_select' from /bundle/ruby/2.7.0/gems/dnsruby-1.61.5/lib/dnsruby/select_thread.rb:152:in `block in check_select_thread_synchronized' That is the entirety of the message and the process seems stuck after this.

While the process is stuck, I ran netstat in the container and noticed several open UDP ports to port 53 on nine external servers. Four of those connections has 2176 in the Recv-Q, all the others had 0.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alexdalitz/dnsruby/issues/171, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2WFWR5KXIW455VEXTBTUTS7JYPPANCNFSM4XWPCHYQ.

seandilda commented 3 years ago

Thank you so much for the quick turnaround. I agree completely that something odd is going on.

This code is part of a long running process that reads off a rabbitmq queue and resolves an IP, returning the authoritative DNS server in the process.

From the DnsRuby point of view, it is a long running process that regularly runs the following code:

Dnsruby::Recursor.clear_caches

recursor = Dnsruby::Recursor.new

# By default, Dnsruby::Recursor will stop recursing at the first answer.  This
# doesn't help us find the authoritative server.  This callback will remove the
# answer if the response isn't authoritative
recursor.recursion_callback = proc do |pkt|
  next if pkt.header.aa # skip rest of proc if packet is authoritative

  # We want to ignore non-authoritative answers, so we clear the answer
  pkt.answer.clear
end

recursor.query("#{fqdn}.", 'CNAME')
recursor.query("#{fqdn}.", 'A')

I recently moved the code from running in a delayed_job instance to its own rabbitmq worker. Its possible something in delayed_job was helping recover from this instance. There is also the possibility that something in my infrastructure is triggering this issue. I've got two sets of workers. One set has seen this twice in the last 4 days. The other set hasn't seen it, even though it runs the exact same queries.

I'm hoping to test and deploy your modified code later today in hopes that the error messages will start to shine some light on this.

Thanks!

alexdalitz commented 2 years ago

Just wondering if you had any updates on this please?

seandilda commented 2 years ago

Thank you for reaching out. I'm sorry I left this open for so long.

We managed to track this down to a bug in our layer2 switches. After a firmware upgrade, the issue went away.

For anyone finding this and looking for more details due to a similar problem, out docker hosts were NATing outbound DNS requests. Occasionally this resulted in a source port of 6784. This is outside most OSes normal ephemeral port range, but can be used when a linux host is NATing (ie when running docker containers). The bug is that the switches were redirecting all udp traffic destined for port 6784 to their control plane, regardless of the destination MAC address. This port is used by a protocol to managed LAG interfaces. We didn't have the protocol enabled on the switches, but it was still swallowing the packets.

alexdalitz commented 2 years ago

Thanks very much for your message! I’m glad you got it sorted

On 29 Nov 2021, at 14:57, Sean Dilda @.***> wrote:

Thank you for reaching out. I'm sorry I left this open for so long.

We managed to track this down to a bug in our layer2 switches. After a firmware upgrade, the issue went away.

For anyone finding this and looking for more details due to a similar problem, out docker hosts were NATing outbound DNS requests. Occasionally this resulted in a source port of 6784. This is outside most OSes normal ephemeral port range, but can be used when a linux host is NATing (ie when running docker containers). The bug is that the switches were redirecting all udp traffic destined for port 6784 to their control plane, regardless of the destination MAC address. This port is used by a protocol to managed LAG interfaces. We didn't have the protocol enabled on the switches, but it was still swallowing the packets.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexdalitz/dnsruby/issues/171#issuecomment-981711822, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2WFWTBAS6ZNHXXKXJL5TDUOOIFBANCNFSM4XWPCHYQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.