Azure / Moneo

Distributed AI/HPC Monitoring Framework
MIT License
25 stars 16 forks source link

Handle exception when publishing logs to Geneva #54

Closed PPPW closed 1 year ago

PPPW commented 1 year ago

We have a test cluster with 2 VMs, the Geneva metrics publisher failed after running for a day. We added logs and found that:

Traceback (most recent call last):
  File "/tmp/moneo-worker/publisher/metrics_publisher.py", line 292, in <module>
    raw_metrics = metricsPublisher.get_metrics()
  File "/tmp/moneo-worker/publisher/metrics_publisher.py", line 165, in get_metrics
    response = urllib.request.urlopen(metrics_url)
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 1383, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.8/urllib/request.py", line 1357, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>

It's some transient issue that caused an exception when the publisher tried to get metrics from the local endpoint. To prevent the process from crashing, we can handle and log the exception. logger.exception will print out the stack trace by default.

Also, we found that if a command times out, the shell_cmd returns a result of string "TimeOut", it'll cause errors in quite a few places in node_exporter.py. The easiest fix is to handle it in the base_exporter.