Open sgrallert opened 2 years ago
Hi everyone,
i have a similar issue. When running wmi queries one by one everything seems fine. After i fire up naemon for regular checks (limited to 6 parallel) with check_wmi_plus (1.68) the server crashes regularly:
Sep 01 15:47:00 ##### nice[25964]: [2022-09-01 15:47:00 +0200] [25964] [CRITICAL] WORKER TIMEOUT (pid:777)
Sep 01 15:47:00 ##### nice[25964]: ERROR:asyncio:Fatal error: protocol.data_received() call failed.
Sep 01 15:47:00 ##### nice[25964]: protocol: <aiowmi.protocol.Protocol object at 0x7f0f7886edf0>
Sep 01 15:47:00 ##### nice[25964]: transport: <_SelectorSocketTransport fd=14 read=polling write=<idle, bufsize=0>>
Sep 01 15:47:00 ##### nice[25964]: Traceback (most recent call last):
Sep 01 15:47:00 ##### nice[25964]: File "/usr/lib/python3.8/asyncio/runners.py", line 43, in run
Sep 01 15:47:00 ##### nice[25964]: return loop.run_until_complete(main)
Sep 01 15:47:00 ##### nice[25964]: File "/usr/lib/python3.8/asyncio/base_events.py", line 595, in run_until_complete
Sep 01 15:47:00 ##### nice[25964]: self.run_forever()
Sep 01 15:47:00 ##### nice[25964]: File "/usr/lib/python3.8/asyncio/base_events.py", line 563, in run_forever
Sep 01 15:47:00 ##### nice[25964]: self._run_once()
Sep 01 15:47:00 ##### nice[25964]: File "/usr/lib/python3.8/asyncio/base_events.py", line 1808, in _run_once
Sep 01 15:47:00 ##### nice[25964]: event_list = self._selector.select(timeout)
Sep 01 15:47:00 ##### nice[25964]: File "/usr/lib/python3.8/selectors.py", line 468, in select
Sep 01 15:47:00 ##### nice[25964]: fd_event_list = self._selector.poll(timeout, max_ev)
Sep 01 15:47:00 ##### nice[25964]: File "/usr/lib/python3/dist-packages/gunicorn/workers/base.py", line 192, in handle_abort
Sep 01 15:47:00 ##### nice[25964]: SystemExit: 1
Sep 01 15:47:00 ##### nice[25964]: During handling of the above exception, another exception occurred:
Sep 01 15:47:00 ##### nice[25964]: Traceback (most recent call last):
Sep 01 15:47:00 ##### nice[25964]: File "/usr/lib/python3.8/asyncio/selector_events.py", line 850, in _read_ready__data_received
Sep 01 15:47:00 ##### nice[25964]: self._protocol.data_received(data)
Sep 01 15:47:00 ##### nice[25964]: File "/etc/aiowmi/aiowmi/protocol.py", line 88, in data_received
Sep 01 15:47:00 ##### nice[25964]: req, data, = self._requests[self._buf.call_id], self._buf.data
Sep 01 15:47:00 ##### nice[25964]: KeyError: 222
This could be a result of the connection which is not being closed when it should. In the latest version, 0.2.2 we fixed closing the connection in the finally block. Is it possible to test if this solves the problem? And if not, can you explain which steps to take to reproduce this issue?
Thanks for the fix. I will apply it to my system and check if the error still occurs. However, this may take a few days, as the error sometimes appeared only after a few days.
To recreate the error you basically just need to set up the wmic_server and query multiple servers. For example, query the classes win32_product, win32_service, win32_logicaldisk and win32computersystem ("select * from win32...") every x minutes on x servers. The error sometimes occurred several times a day, sometimes only every few days.
Hi @joente and thank you for the quick fix. unfortunately the issue is still present in my enviroment:
Sep 06 15:05:10 ##### nice[6999]: [2022-09-06 15:05:10 +0200] [6999] [CRITICAL] WORKER TIMEOUT (pid:7016)
Sep 06 15:05:10 ##### nice[6999]: ERROR:asyncio:Fatal error: protocol.data_received() call failed.
Sep 06 15:05:10 ##### nice[6999]: protocol: <aiowmi.protocol.Protocol object at 0x7f89b60f9670>
Sep 06 15:05:10 ##### nice[6999]: transport: <_SelectorSocketTransport fd=14 read=polling write=<idle, bufsize=0>>
Sep 06 15:05:10 ##### nice[6999]: Traceback (most recent call last):
Sep 06 15:05:10 ##### nice[6999]: File "/usr/lib/python3.8/asyncio/runners.py", line 43, in run
Sep 06 15:05:10 ##### nice[6999]: return loop.run_until_complete(main)
Sep 06 15:05:10 ##### nice[6999]: File "/usr/lib/python3.8/asyncio/base_events.py", line 595, in run_until_complete
Sep 06 15:05:10 ##### nice[6999]: self.run_forever()
Sep 06 15:05:10 ##### nice[6999]: File "/usr/lib/python3.8/asyncio/base_events.py", line 563, in run_forever
Sep 06 15:05:10 ##### nice[6999]: self._run_once()
Sep 06 15:05:10 ##### nice[6999]: File "/usr/lib/python3.8/asyncio/base_events.py", line 1808, in _run_once
Sep 06 15:05:10 ##### nice[6999]: event_list = self._selector.select(timeout)
Sep 06 15:05:10 ##### nice[6999]: File "/usr/lib/python3.8/selectors.py", line 468, in select
Sep 06 15:05:10 ##### nice[6999]: fd_event_list = self._selector.poll(timeout, max_ev)
Sep 06 15:05:10 ##### nice[6999]: File "/usr/local/lib/python3.8/dist-packages/gunicorn/workers/base.py", line 203, in handle_abort
Sep 06 15:05:10 ##### nice[6999]: sys.exit(1)
Sep 06 15:05:10 ##### nice[6999]: SystemExit: 1
Sep 06 15:05:10 ##### nice[6999]: During handling of the above exception, another exception occurred:
Sep 06 15:05:10 ##### nice[6999]: Traceback (most recent call last):
Sep 06 15:05:10 ##### nice[6999]: File "/usr/lib/python3.8/asyncio/selector_events.py", line 850, in _read_ready__data_received
Sep 06 15:05:10 ##### nice[6999]: self._protocol.data_received(data)
Sep 06 15:05:10 ##### nice[6999]: File "/etc/aiowmi/aiowmi/protocol.py", line 88, in data_received
Sep 06 15:05:10 ##### nice[6999]: req, data, = self._requests[self._buf.call_id], self._buf.data
Sep 06 15:05:10 ##### nice[6999]: KeyError: 149
Sep 06 15:05:11 ##### nice[6999]: [2022-09-06 15:05:11 +0200] [6999] [WARNING] Worker with pid 7016 was terminated due to signal 9
Sep 06 15:05:11 ##### nice[6999]: [2022-09-06 15:05:11 +0200] [31914] [INFO] Booting worker with pid: 31914
the current setup is: ubuntu 18.4 python 3.8 aiowmi 0.2.2 on gunicorn (version 20.1.0) check_wmi_plus 1.68 omd labs edition 4.60 with naemon core
so after i have startet the wmic_server service i can run multiple check_wmi_plus checks without any issues. Everything is working like a charm. The issue only occures only after i start the omd site which has multiple parallel check_wmi_plus checks. the wmic_server service becomes unresponsive and is terminated within the first 30sek.
i can try to reduce the number of checks by moving out some naemon cfg files to reduce the number of checks if it helps.
The error seems to still occur for me as well.
Meanwhile, it can be run as a SystemD service without using gunicorn by using these lines in wmic_server.service:
ExecStart=/bin/sh -c 'PYTHONPATH=/opt/aiowmi python3 /opt/aiowmi/contrib/wmic_server/wmic_server.py'
ExecStop=/usr/bin/pkill -f 'python3 .*wmic_server\.py'
Alter to suit, of course.
Hello everyone, first of all thank you for this great library. After wmic stopped working, I searched a long time for a good replacement.
I use the wmic_server to monitor some servers. But now I get the following error in irregular intervals:
The servers are read out every 5 minutes with several querys. What could be the reason for the error?