DataDog / integrations-extras

Community developed integrations and plugins for the Datadog Agent.
BSD 3-Clause "New" or "Revised" License
254 stars 742 forks source link

NS1 Integration does not recognize NS1's API limits and therefore does not work for large zone and record lists #1134

Closed rlee4advancelocal closed 2 years ago

rlee4advancelocal commented 2 years ago
2022-01-13 20:47:37 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:68 in Error) | check:ns1 | Error running check: [{"message": "429 Client Error: Too Many Requests for url: https://api.nsone.net/v1/zones/redacted1.com", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1017, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/check.py\", line 48, in check\n checkUrl = self.create_url(self.metrics, self.query_params, self.networks)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/check.py\", line 92, in create_url\n checkUrl.update(self.ns1.get_stats_url_usage(key, val, networknames))\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/ns1_url_utils.py\", line 61, in get_stats_url_usage\n records =self.check.get_zone_records(domain)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/check.py\", line 140, in get_zone_records\n res = self.get_stats(url)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/check.py\", line 395, in get_stats\n response.raise_for_status()\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/models.py\", line 943, in raise_for_status\n raise HTTPError(http_error_msg, response=self)\nrequests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.nsone.net/v1/zones/redacted1.com\n"}]
2022-01-13 20:47:37 UTC | CORE | INFO | (pkg/collector/worker/check_logger.go:56 in CheckFinished) | check:ns1 | Done running check
2022-01-13 20:47:53 UTC | CORE | INFO | (pkg/collector/worker/check_logger.go:37 in CheckStarted) | check:ns1 | Running check...
2022-01-13 20:47:53 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:126 in LogMessage) | ns1:c04117e28d82671f | (check.py:42) | Startup
2022-01-13 20:48:20 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:68 in Error) | check:ns1 | Error running check: [{"message": "429 Client Error: Too Many Requests for url: https://api.nsone.net/v1/zones/redacted2.com", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1017, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/check.py\", line 48, in check\n checkUrl = self.create_url(self.metrics, self.query_params, self.networks)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/check.py\", line 87, in create_url\n checkUrl.update(self.ns1.get_stats_url_qps(key,val))\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/ns1_url_utils.py\", line 136, in get_stats_url_qps\n records = self.check.get_zone_records(domain)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/check.py\", line 140, in get_zone_records\n res = self.get_stats(url)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/ns1/check.py\", line 395, in get_stats\n response.raise_for_status()\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/models.py\", line 943, in raise_for_status\n raise HTTPError(http_error_msg, response=self)\nrequests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.nsone.net/v1/zones/redacted2.com\n"}]

Steps to reproduce the issue:

  1. Have a large number of zones & records
  2. Let agent run and max out NS1 API queries
  3. Agent will fail the lookup and continue to meet additional failures via API query limit

Describe the results you received: When testing the agent against our org's zone list, we found that the module does not have a way to: 1) Control the rate of API queries made so the API query limit would not be met 2) Retry failed API queries 3) Exponentially back off on failed queries

Describe the results you expected: Expected that that application would be aware of API query limits in NS1 ( https://help.ns1.com/hc/en-us/articles/360020250573-About-API-rate-limiting ) and work with a large zone/record list.

Additional information you deem important (e.g. issue happens only occasionally): Additionally, we think the application might pre-maturely abort querying a list of zones and records for stats when encountering a misconfiguration, e.g. a record that doesn't exist. Ideally, the entire configuration list would not be dropped at the first record that doesn't exist as dns records can often be added, changed, or removed.

hithwen commented 2 years ago

Pinging @dblagojevic-daitan as integration maintainer

yzhan289 commented 2 years ago

Closing this ticket since the PR was merged.