dougbw / coredns_omada

CoreDNS plugin for TP-Link Omada SDN
Apache License 2.0
84 stars 9 forks source link

If controller is unavailable coredns crashes #41

Closed tim-duncan closed 5 months ago

tim-duncan commented 5 months ago

@dougbw ,

Maybe this is just me but if my controller is unavailable for some reason, the plugin refresh process seems to cause coredns to crash.

I am not familiar with coredns plugins, but is there a way to gracefully handle an offline controller? Perhaps just log the failure instead? That way clients can continue to get external resolution while the controller comes back online.

(I also really appreciate your efforts here. The plugin is super useful for me.)

dougbw commented 5 months ago

Hey, the way this was meant to work was that the if the initial zone load from the controller fails then the plugin should terminate as this helps catch misconfigurations quickly. Subsequent zone update failures should just log an error and continue serving the existing zones.

This definitely worked as intended at some point but it's possible that it stopped working this way. If you are able to reproduce it could you enable debug logs and post them here as it would help troubleshoot.

timduncan-innowell commented 5 months ago

@dougbw ,

Ah ok... I think in my case the controller was not available when the DNS server/service was restarted because I was moving both around in the rack. I only noticed the issue because all the clients on the network failed to resolve while coredns was available.

I guess it would be nice if fail-fast was a choice. Also be nice if when I have more than one coredns plugin, that a single bad config didn't break the entire chain. Maybe thats how coredns prefers it - not my area of expertise.

I can probably cope with this scenario on my network by:

  1. setting secondary DNS server to the upstream for all my clients
  2. ensuring coredns has a "restart always" policy running for the container - it will eventually recover when the controller is responding

Happy for you to close (and perhaps consider making fast-fail configurable).

tim-duncan commented 5 months ago

Oops. Just realised my last response was with my work account instead of my personal account that opened the issue.

All comments still apply. Thanks.

dougbw commented 5 months ago

Hey, I just tested this and I don't seem to be able to reproduce any issues here. The behaviour is as intended:

Example logs from this scenario:

dbw:/mnt/c/repos/coredns$ sudo OMADA_DISABLE_HTTPS_VERIFICATION=true ./coredns
[INFO] plugin/omada: logging in...
[INFO] plugin/omada: found '1' sites: [Home]
[INFO] plugin/omada: update: updating zones...
.:53
CoreDNS-1.11.1
linux/amd64, go1.20.2, 945db2fa-dirty
[DEBUG] plugin/omada: query; type: 1, name: oc200.internal.domain.
[DEBUG] plugin/omada: checking if zone is managed: oc200.internal.domain.
[DEBUG] plugin/omada: -- ✅ zone name: internal.domain.
[DEBUG] plugin/omada: -- ✅ answer len: 1, result: 0
[INFO] 127.0.0.1:53749 - 26567 "A IN oc200.internal.domain. udp 56 false 4096" NOERROR qr,aa,rd 64 0.007979989s
[DEBUG] plugin/omada: query; type: 1, name: oc200.internal.domain.
[DEBUG] plugin/omada: checking if zone is managed: oc200.internal.domain.
[DEBUG] plugin/omada: -- ✅ zone name: internal.domain.
[DEBUG] plugin/omada: -- ✅ answer len: 1, result: 0
[INFO] 127.0.0.1:38224 - 65155 "A IN oc200.internal.domain. udp 56 false 4096" NOERROR qr,aa,rd 64 0.000259123s

[INFO] plugin/omada: update: updating zones...
[DEBUG] plugin/omada: update: getting networks for site: Home
[ERROR] plugin/omada: Failed to update zones: error getting networks from omada controller: Get "https://10.0.0.10/978bee230c77bbb45d9c8545d04d700a/api/v2/sites/Default/setting/lan/networks?currentPage=1&currentPageSize=999": dial tcp 10.0.0.10:443: connect: connection refused

[DEBUG] plugin/omada: query; type: 1, name: oc200.internal.domain.
[DEBUG] plugin/omada: checking if zone is managed: oc200.internal.domain.
[DEBUG] plugin/omada: -- ✅ zone name: internal.domain.
[DEBUG] plugin/omada: -- ✅ answer len: 1, result: 0
[INFO] 127.0.0.1:43929 - 24159 "A IN oc200.internal.domain. udp 56 false 4096" NOERROR qr,aa,rd 64 0.00038925s
tim-duncan commented 5 months ago

@dougbw yes, I was able to confirm that the plugin works as you described. I will close this issue.

My scenario was definitely caused when both the DNS server and the controller lost power at approximately the same time and the coredns crashed on the first poll because the controller was not yet ready when restarting. Once coredns container was configured to restart-always, it would generally restart a few times before the controller returns to a ready state.

I also added the secondary DNS server (pointing directly to the upstream) for the clients and now any connectivity issues go away in this power-outage-restart scenario.

Thanks again for your work.