DataDog / dd-agent

Datadog Agent Version 5
https://docs.datadoghq.com/
Other
1.3k stars 812 forks source link

Agent stops working when connectivity is lost and never recovers #3379

Open iongion opened 7 years ago

iongion commented 7 years ago

Playing with this in Virtualbox with the connect/disconnect adapter shortcut.

In /var/log/datadog/collector.log

 017-06-08 14:04:07 CEST | ERROR | dd.collector | checks.tcp_check(network_checks.py:161) | Failed to process instance ''.
Traceback (most recent call last):
  File "/opt/datadog-agent/agent/checks/network_checks.py", line 147, in _process
    statuses = self._check(instance)
  File "/opt/datadog-agent/agent/checks.d/tcp_check.py", line 61, in _check
    addr, port, custom_tags, socket_type, timeout, response_time = self._load_conf(instance)
  File "/opt/datadog-agent/agent/checks.d/tcp_check.py", line 56, in _load_conf
    raise BadConfException("URL: %s is not a correct IPv4, IPv6 or hostname" % addr)
UnboundLocalError: local variable 'addr' referenced before assignment

In /var/log/datadog/dogstatsd.log

2017-06-08 14:04:02 CEST | ERROR | dd.dogstatsd | dogstatsd(dogstatsd.py:325) | Unable to post payload.
Traceback (most recent call last):
  File "/opt/datadog-agent/agent/dogstatsd.py", line 315, in submit_http
    r = requests.post(url, data=data, timeout=5, headers=headers)
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py", line 110, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py", line 596, in send
    r = adapter.send(request, **kwargs)
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py", line 499, in send
    raise ReadTimeout(e, request=request)
ReadTimeout: HTTPConnectionPool(host='localhost', port=17123): Read timed out. (read timeout=5)
olivielpeau commented 7 years ago

Hi @iongion, these 2 errors could be expected when connectivity is lost (without making the Agent crash though), and these errors should disappear once connectivity is re-established.

If this is not the case for you, the best way to troubleshoot this would be to send a flare to our support team, they should be able to help you out, thanks!