Open anxstj opened 4 years ago
I'm betting this is due to our usage of a local domain socket and max packet size for SOCK_DGRAM
. Probably you've got events periodically that are really large (fact data, perhaps).
One approach here is to just use TCP, but we didn't do that because we didn't want to have to deal with connection security.
Maybe we could detect this type of situation and automatically attempt to truncate the message? Though, I could see that upsetting external log aggregators that relied on e.g., valid JSON. Or perhaps we could chunk the message ourselves.
We are also getting these stacktraces and error messages in our logs. I'm worried that some logging information and job events are being dropped because of this. We receive this message when syncing our shared ansible repo project which involves a full delete/clone of the git repo and many many files.
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib64/python3.6/logging/handlers.py", line 936, in emit
self.socket.send(msg)
OSError: [Errno 90] Message too long
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib64/python3.6/logging/handlers.py", line 940, in emit
self.socket.send(msg)
OSError: [Errno 90] Message too long
Call stack:
File "/usr/bin/awx-manage", line 8, in <module>
sys.exit(manage())
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/__init__.py", line 154, in manage
execute_from_command_line(sys.argv)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_callback_receiver.py", line 32, in handle
queues=[getattr(settings, 'CALLBACK_QUEUE', '')],
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 59, in __init__
self.pool.init_workers(self.worker.work_loop)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/pool.py", line 241, in init_workers
self.up()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/pool.py", line 253, in up
worker.start()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/pool.py", line 78, in start
self.process.start()
File "/usr/lib64/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib64/python3.6/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib64/python3.6/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 73, in _launch
code = process_obj._bootstrap()
File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/callback.py", line 99, in work_loop
return super(CallbackBrokerWorker, self).work_loop(*args, **kw)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/base.py", line 191, in work_loop
self.perform_work(body, *args)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/worker/callback.py", line 178, in perform_work
event = cls.create_from_data(**body)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/models/events.py", line 437, in create_from_data
event._update_from_event_data()
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/models/events.py", line 392, in _update_from_event_data
extra=dict(python_objects=dict(job_event=self))
File "/usr/lib64/python3.6/logging/__init__.py", line 1308, in info
self._log(INFO, msg, args, **kwargs)
File "/usr/lib64/python3.6/logging/__init__.py", line 1444, in _log
self.handle(record)
File "/usr/lib64/python3.6/logging/__init__.py", line 1454, in handle
self.callHandlers(record)
File "/usr/lib64/python3.6/logging/__init__.py", line 1516, in callHandlers
hdlr.handle(record)
File "/usr/lib64/python3.6/logging/__init__.py", line 865, in handle
self.emit(record)
File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/utils/handlers.py", line 30, in emit
return super(RSysLogHandler, self).emit(msg)
Message: 'Event data saved.'
Arguments: ()
I'm dubious that "Event data saved" actually happened 😬
@hjkatz on a whim - what happens if you set MAX_EVENT_RES=60000
in AWX_TASK_ENV
(or MAX_EVENT_RES_DATA
in the deployment settings)
Maybe should write the JobEvent logs to a file and configure RSysLog to send them to the log aggregator? I think this should fix the problem. Or is this a bad idea?
Was there ever a solution found for this? I'm seeing it in v15 and log data is definitely being dropped.
setting this as a workaround helped in my case: https://github.com/ansible/awx-operator#no-log
Hi @pfeifferj What's the workaround that has helped you? I can't see anything specific in that link.
setting this as a workaround helped in my case: https://github.com/ansible/awx-operator#no-log
@lals1 I vaguely remember that the playbooks generated a lot of logs, which filled up /var/log/
and led to AAP execution pods dying. setting the no-log option in the operator helped. iirc, there was also a bug in OpenShift/kubernetes -- which was fixed in ocp 4.10.16. this bug made the logging issue worse with the AAP/AWX operator. scheduling the AAP/AWX operator workloads on its dedicated nodes also helped.
hope this helps!
Thanks for the response @pfeifferj
In my case, this issue started when I upgraded AWX version from 21.0.0 to 23.2.0. It was working fine earlier. We use an external log aggregator with type logstash for pushing only AWX-related logs.
Just wondering what has changed in between those AWX versions that rsyslog is now complaining about message too long 🤔
This is still an issue using AWX 24.6.1, preventing us from shipping package facts to Splunk -- does anyone have a solution?
ISSUE TYPE
SUMMARY
The awx_task container logs a traceback:
I stumbled on it while examining the logs and found it worth reporting.
ENVIRONMENT
STEPS TO REPRODUCE
Actually I don't know how to reproduce this.
EXPECTED RESULTS
No traceback.
ACTUAL RESULTS
A traceback is logged.
ADDITIONAL INFORMATION