Errors at Irregular Intervals

sgrallert commented 2 years ago

Hello everyone, first of all thank you for this great library. After wmic stopped working, I searched a long time for a good replacement.

I use the wmic_server to monitor some servers. But now I get the following error in irregular intervals:

Aug 25 16:26:14 ##### nice[675]: [2022-08-25 16:26:14 +0200] [675] [CRITICAL] WORKER TIMEOUT (pid:122590)
Aug 25 16:26:14 ##### nice[122590]: ERROR:asyncio:Fatal error: protocol.data_received() call failed.
Aug 25 16:26:14 ##### nice[122590]: protocol: <aiowmi.protocol.Protocol object at 0x7fb800628a30>
Aug 25 16:26:14 ##### nice[122590]: transport: <_SelectorSocketTransport fd=14 read=polling write=<idle, bufsize=0>>
Aug 25 16:26:14 ##### nice[122590]: Traceback (most recent call last):
Aug 25 16:26:14 ##### nice[122590]:   File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
Aug 25 16:26:14 ##### nice[122590]:     return loop.run_until_complete(main)
Aug 25 16:26:14 ##### nice[122590]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
Aug 25 16:26:14 ##### nice[122590]:     self.run_forever()
Aug 25 16:26:14 ##### nice[122590]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
Aug 25 16:26:14 ##### nice[122590]:     self._run_once()
Aug 25 16:26:14 ##### nice[122590]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 1823, in _run_once
Aug 25 16:26:14 ##### nice[122590]:     event_list = self._selector.select(timeout)
Aug 25 16:26:14 ##### nice[122590]:   File "/usr/lib/python3.8/selectors.py", line 468, in select
Aug 25 16:26:14 ##### nice[122590]:     fd_event_list = self._selector.poll(timeout, max_ev)
Aug 25 16:26:14 ##### nice[122590]:   File "/usr/local/lib/python3.8/dist-packages/gunicorn/workers/base.py", line 203, in handle_abort
Aug 25 16:26:14 ##### nice[122590]:     sys.exit(1)
Aug 25 16:26:14 ##### nice[122590]: SystemExit: 1
Aug 25 16:26:14 ##### nice[122590]: During handling of the above exception, another exception occurred:
Aug 25 16:26:14 ##### nice[122590]: Traceback (most recent call last):
Aug 25 16:26:14 ##### nice[122590]:   File "/usr/lib/python3.8/asyncio/selector_events.py", line 862, in _read_ready__data_received
Aug 25 16:26:14 ##### nice[122590]:     self._protocol.data_received(data)
Aug 25 16:26:14 ##### nice[122590]:   File "/usr/local/lib/python3.8/dist-packages/aiowmi/protocol.py", line 86, in data_received
Aug 25 16:26:14 ##### nice[122590]:     req, data, = self._requests[self._buf.call_id], self._buf.data
Aug 25 16:26:14 ##### nice[122590]: KeyError: 7
Aug 25 16:26:15 ##### nice[675]: [2022-08-25 16:26:15 +0200] [675] [WARNING] Worker with pid 122590 was terminated due to signal 9
Aug 25 16:26:15 ##### nice[133453]: [2022-08-25 16:26:15 +0200] [133453] [INFO] Booting worker with pid: 133453

The servers are read out every 5 minutes with several querys. What could be the reason for the error?

eduni commented 2 years ago

Hi everyone,

i have a similar issue. When running wmi queries one by one everything seems fine. After i fire up naemon for regular checks (limited to 6 parallel) with check_wmi_plus (1.68) the server crashes regularly:

Sep 01 15:47:00 ##### nice[25964]: [2022-09-01 15:47:00 +0200] [25964] [CRITICAL] WORKER TIMEOUT (pid:777)
Sep 01 15:47:00 ##### nice[25964]: ERROR:asyncio:Fatal error: protocol.data_received() call failed.
Sep 01 15:47:00 ##### nice[25964]: protocol: <aiowmi.protocol.Protocol object at 0x7f0f7886edf0>
Sep 01 15:47:00 ##### nice[25964]: transport: <_SelectorSocketTransport fd=14 read=polling write=<idle, bufsize=0>>
Sep 01 15:47:00 ##### nice[25964]: Traceback (most recent call last):
Sep 01 15:47:00 ##### nice[25964]:   File "/usr/lib/python3.8/asyncio/runners.py", line 43, in run
Sep 01 15:47:00 ##### nice[25964]:     return loop.run_until_complete(main)
Sep 01 15:47:00 ##### nice[25964]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 595, in run_until_complete
Sep 01 15:47:00 ##### nice[25964]:     self.run_forever()
Sep 01 15:47:00 ##### nice[25964]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 563, in run_forever
Sep 01 15:47:00 ##### nice[25964]:     self._run_once()
Sep 01 15:47:00 ##### nice[25964]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 1808, in _run_once
Sep 01 15:47:00 ##### nice[25964]:     event_list = self._selector.select(timeout)
Sep 01 15:47:00 ##### nice[25964]:   File "/usr/lib/python3.8/selectors.py", line 468, in select
Sep 01 15:47:00 ##### nice[25964]:     fd_event_list = self._selector.poll(timeout, max_ev)
Sep 01 15:47:00 ##### nice[25964]:   File "/usr/lib/python3/dist-packages/gunicorn/workers/base.py", line 192, in handle_abort
Sep 01 15:47:00 ##### nice[25964]: SystemExit: 1
Sep 01 15:47:00 ##### nice[25964]: During handling of the above exception, another exception occurred:
Sep 01 15:47:00 ##### nice[25964]: Traceback (most recent call last):
Sep 01 15:47:00 ##### nice[25964]:   File "/usr/lib/python3.8/asyncio/selector_events.py", line 850, in _read_ready__data_received
Sep 01 15:47:00 ##### nice[25964]:     self._protocol.data_received(data)
Sep 01 15:47:00 ##### nice[25964]:   File "/etc/aiowmi/aiowmi/protocol.py", line 88, in data_received
Sep 01 15:47:00 ##### nice[25964]:     req, data, = self._requests[self._buf.call_id], self._buf.data
Sep 01 15:47:00 ##### nice[25964]: KeyError: 222

joente commented 2 years ago

This could be a result of the connection which is not being closed when it should. In the latest version, 0.2.2 we fixed closing the connection in the finally block. Is it possible to test if this solves the problem? And if not, can you explain which steps to take to reproduce this issue?

sgrallert commented 2 years ago

Thanks for the fix. I will apply it to my system and check if the error still occurs. However, this may take a few days, as the error sometimes appeared only after a few days.

To recreate the error you basically just need to set up the wmic_server and query multiple servers. For example, query the classes win32_product, win32_service, win32_logicaldisk and win32computersystem ("select * from win32...") every x minutes on x servers. The error sometimes occurred several times a day, sometimes only every few days.

eduni commented 2 years ago

Hi @joente and thank you for the quick fix. unfortunately the issue is still present in my enviroment:

Sep 06 15:05:10 ##### nice[6999]: [2022-09-06 15:05:10 +0200] [6999] [CRITICAL] WORKER TIMEOUT (pid:7016)
Sep 06 15:05:10 ##### nice[6999]: ERROR:asyncio:Fatal error: protocol.data_received() call failed.
Sep 06 15:05:10 ##### nice[6999]: protocol: <aiowmi.protocol.Protocol object at 0x7f89b60f9670>
Sep 06 15:05:10 ##### nice[6999]: transport: <_SelectorSocketTransport fd=14 read=polling write=<idle, bufsize=0>>
Sep 06 15:05:10 ##### nice[6999]: Traceback (most recent call last):
Sep 06 15:05:10 ##### nice[6999]:   File "/usr/lib/python3.8/asyncio/runners.py", line 43, in run
Sep 06 15:05:10 ##### nice[6999]:     return loop.run_until_complete(main)
Sep 06 15:05:10 ##### nice[6999]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 595, in run_until_complete
Sep 06 15:05:10 ##### nice[6999]:     self.run_forever()
Sep 06 15:05:10 ##### nice[6999]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 563, in run_forever
Sep 06 15:05:10 ##### nice[6999]:     self._run_once()
Sep 06 15:05:10 ##### nice[6999]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 1808, in _run_once
Sep 06 15:05:10 ##### nice[6999]:     event_list = self._selector.select(timeout)
Sep 06 15:05:10 ##### nice[6999]:   File "/usr/lib/python3.8/selectors.py", line 468, in select
Sep 06 15:05:10 ##### nice[6999]:     fd_event_list = self._selector.poll(timeout, max_ev)
Sep 06 15:05:10 ##### nice[6999]:   File "/usr/local/lib/python3.8/dist-packages/gunicorn/workers/base.py", line 203, in handle_abort
Sep 06 15:05:10 ##### nice[6999]:     sys.exit(1)
Sep 06 15:05:10 ##### nice[6999]: SystemExit: 1
Sep 06 15:05:10 ##### nice[6999]: During handling of the above exception, another exception occurred:
Sep 06 15:05:10 ##### nice[6999]: Traceback (most recent call last):
Sep 06 15:05:10 ##### nice[6999]:   File "/usr/lib/python3.8/asyncio/selector_events.py", line 850, in _read_ready__data_received
Sep 06 15:05:10 ##### nice[6999]:     self._protocol.data_received(data)
Sep 06 15:05:10 ##### nice[6999]:   File "/etc/aiowmi/aiowmi/protocol.py", line 88, in data_received
Sep 06 15:05:10 ##### nice[6999]:     req, data, = self._requests[self._buf.call_id], self._buf.data
Sep 06 15:05:10 ##### nice[6999]: KeyError: 149
Sep 06 15:05:11 ##### nice[6999]: [2022-09-06 15:05:11 +0200] [6999] [WARNING] Worker with pid 7016 was terminated due to signal 9
Sep 06 15:05:11 ##### nice[6999]: [2022-09-06 15:05:11 +0200] [31914] [INFO] Booting worker with pid: 31914

the current setup is: ubuntu 18.4 python 3.8 aiowmi 0.2.2 on gunicorn (version 20.1.0) check_wmi_plus 1.68 omd labs edition 4.60 with naemon core

so after i have startet the wmic_server service i can run multiple check_wmi_plus checks without any issues. Everything is working like a charm. The issue only occures only after i start the omd site which has multiple parallel check_wmi_plus checks. the wmic_server service becomes unresponsive and is terminated within the first 30sek.

i can try to reduce the number of checks by moving out some naemon cfg files to reduce the number of checks if it helps.

sgrallert commented 2 years ago

The error seems to still occur for me as well.

yjfguhd commented 2 years ago

Meanwhile, it can be run as a SystemD service without using gunicorn by using these lines in wmic_server.service:

ExecStart=/bin/sh -c 'PYTHONPATH=/opt/aiowmi python3 /opt/aiowmi/contrib/wmic_server/wmic_server.py'
ExecStop=/usr/bin/pkill -f 'python3 .*wmic_server\.py'

Alter to suit, of course.

cesbit / aiowmi

Errors at Irregular Intervals #18