We have a test cluster with 2 VMs, the Geneva metrics publisher failed after running for a day. We added logs and found that:
Traceback (most recent call last):
File "/tmp/moneo-worker/publisher/metrics_publisher.py", line 292, in <module>
raw_metrics = metricsPublisher.get_metrics()
File "/tmp/moneo-worker/publisher/metrics_publisher.py", line 165, in get_metrics
response = urllib.request.urlopen(metrics_url)
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 1383, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.8/urllib/request.py", line 1357, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>
It's some transient issue that caused an exception when the publisher tried to get metrics from the local endpoint. To prevent the process from crashing, we can handle and log the exception. logger.exception will print out the stack trace by default.
Also, we found that if a command times out, the shell_cmd returns a result of string "TimeOut", it'll cause errors in quite a few places in node_exporter.py. The easiest fix is to handle it in the base_exporter.
We have a test cluster with 2 VMs, the Geneva metrics publisher failed after running for a day. We added logs and found that:
It's some transient issue that caused an exception when the publisher tried to get metrics from the local endpoint. To prevent the process from crashing, we can handle and log the exception.
logger.exception
will print out the stack trace by default.Also, we found that if a command times out, the
shell_cmd
returns a result of string "TimeOut", it'll cause errors in quite a few places innode_exporter.py
. The easiest fix is to handle it in thebase_exporter
.