Closed Borrelworst closed 6 years ago
Any suggestions on what can be done to troubleshoot this further please?
Also having this problem in k8s. Tried a few things listed here, but still will randomly get closed sockets even when directly connected to the web container. If there are any debugging things to run, I can do so if needed.
I'm unclear on what might be causing the closed sockets SamKirsch mentioned, but that sounds like a deeper, different issue and one not entirely constrained to the job details page?
There are some race conditions involving setting up the initial connection to the job details page that have been resolved downstream and will be landing in AWX shortly.
These changes might resolve some of the issues mentioned by others above - one way to know if they will help is if you're currently still able to see dynamic updates to socket-driven content other than the incoming output lines (status icons, elapsed times, project updates, etc.).
If nothing is updating dynamically anywhere on the app during job runs then this points to a potentially deeper configuration issue. If this is the case for you it might be worth opening a separate github issue (or visiting our IRC channel) to help in tracking your specific problem down, as there are many different potential underlying causes for socket connectivity issues.
The closed sockets I am talking about are all in this thread. Closed websockets. I notice closed websockets after an unspecified time (it's not always the same) when I try to view job details and also jobs that are running / have run. This does not mean it never shows, sometimes a full container restart lets everything show again. I hope the upcoming upstream changed will help :)
So I've found the reason for my issues and why I couldn't see the job details. It was down to the chrome version I had installed.
61.0.3163.79
caused issues where the 'working' wheel was just spinning.
Upgrading to 67.0.3396.99
fixed these issues and I can now see the job details.
@grahamn-gr Thanks for your answer, I updated my chrome to newest version and the problem solved!
It sounds like a number of people are having better luck with a newer version of Chrome, though from the variety of comments, it feels like this ticket has become a catch-all for any sort of odd bug related to the job details page.
I'm going to go ahead and close this; if anybody continues to encounter issues in 1.0.7, please let us know by filing a new issue with details.
@ryanpetrello jfyi still facing this issue, version 1.0.7.2
@boris-42 can you provide the environment details from https://github.com/ansible/awx/issues/new?template=bug_report.md, including web browser version?
@ryanpetrello
Some observation:
It sounds to me like job events aren't being saved into the database. This can be caused by a number of things. Do you see anything when you visit /api/v2/jobs/N/event/
?
@ryanpetrello I suspect you meant jobs_events.
it returns
{
"count": 0,
"next": null,
"previous": null,
"results": []
}
If I restart awx-task and awx-web this information gets populated. And it continues working until we see in awx-task that log message related to rabbitmq
Yep, that's exactly what I meant, thanks :)
In your awx task container, can you run:
supervisorctl -c /supervisor_task.conf status
@ryanpetrello
bash-4.2$ supervisorctl -c /supervisor_task.conf status
awx-config-watcher RUNNING pid 195, uptime 12:38:18
tower-processes:callback-receiver RUNNING pid 199, uptime 12:38:18
tower-processes:celery RUNNING pid 196, uptime 12:38:18
tower-processes:celery-watcher RUNNING pid 198, uptime 12:38:18
tower-processes:channels-worker RUNNING pid 197, uptime 12:38:18
@ryanpetrello
Some more information:
@ryanpetrello Some more details. Bug is reproduced on many version of AWX.
If i run /usr/bin/awx-manage run_callback_receiver in task container
All results get send to database...
More interesting thing is this piece of code: https://github.com/ansible/awx/blob/devel/awx/main/management/commands/run_callback_receiver.py#L233-L238
If something happens to rabbitmq and we got broken connection it's not recrated, from other side we have large try/except in code that uses connection, which doesn't let run_callback_reciever crash so supervisor will be bring it back...
@boris-42 the example you linked is catching KeyboardInterrupt
- I'd expect the callback receiver to gracefully handle and recover from AMQP unavailability in the way you described (testing this a bit myself).
I'm having a hard time reproducing this by stopping RabbitMQ - the callback receiver recovers for me after stopping and starting the message broker:
It also seems resilient to me screwing w/ TCP via tcpkill
:
@boris-42 do you see any logs in the task container for the callback receiver that might provide some hints?
IMHO, I don't know why this issue is closed when is still happening, even with the recent versions.
@josemgom the reason it's closed is that the original reporter described their issue and found a solution to it here: https://github.com/ansible/awx/issues/1861#issuecomment-388286258
(also, see: https://github.com/ansible/awx/issues/1861#issuecomment-415033350)
The number of people chiming in on this one has generated a lot of noise; it's likely people are encountering a number of issues across a variety of configurations that are being conflated:
X-Forwarded-For
configurationIf you're still encountering an issue with the job details page, and you're using the most recent version of awx, and none of the suggestions in this comment thread have addressed it for you, then please open a new issue with as much detail as possible about the problem you're encountering: https://github.com/ansible/awx/issues/new?template=bug_report.md
In the meantime, I and other awx maintainers are happy to help as much as possible here (see my and others' various interactions with people above) and in our IRC room on freenode (#awx-devel
).
@ryanpetrello you are back ! =)
Steps to reproduce:
Hey @boris-42,
Do you see any logs in the task container for the callback receiver that might provide some hints? Errors/exceptions/tracebacks?
@boris-42 @strawgate @DBLaci @nmpacheco and others who have encountered the Connection reset by peer
errors: we think we might have an idea of what's causing this issue. If any of you are feeling like experimenting, could you give this PR a try in your environments to see if it improves things?
https://github.com/ansible/awx/pull/2391
Alternatively, you could try running something like this (in all of your containers) and then restarting awx services to get the latest version:
~ /var/lib/awx/venv/awx/bin/pip uninstall asgi-amqp
~ /var/lib/awx/venv/awx/bin/pip install "asgi-amqp==1.1.2"
@ryanpetrello Thanks, I'll try to patch container this weekend!
Thanks @ryanpetrello
I just upgraded the package in my development and production envs. I let you know if the users still facing this issue.
Running:
/var/lib/awx/venv/awx/bin/pip install -U asgi-amqp==1.1.2
brought in a newer version of kombu 4.2.1
which starts breaking daphne/celery badly.
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/bin/daphne", line 11, in <module>
sys.exit(CommandLineInterface.entrypoint())
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/daphne/cli.py", line 144, in entrypoint
cls().run(sys.argv[1:])
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/daphne/cli.py", line 174, in run
channel_layer = importlib.import_module(module_path)
File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/usr/lib/python2.7/site-packages/awx/asgi.py", line 9, in <module>
prepare_env() # NOQA
File "/usr/lib/python2.7/site-packages/awx/__init__.py", line 55, in prepare_env
if not settings.DEBUG: # pragma: no cover
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 56, in __getattr__
self._setup(name)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 41, in _setup
self._wrapped = Settings(settings_module)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 110, in __init__
mod = importlib.import_module(self.SETTINGS_MODULE)
File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/usr/lib/python2.7/site-packages/awx/settings/production.py", line 17, in <module>
from defaults import * # NOQA
File "/usr/lib/python2.7/site-packages/awx/settings/defaults.py", line 7, in <module>
import djcelery
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/djcelery/__init__.py", line 34, in <module>
from celery import current_app as celery # noqa
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/five.py", line 312, in __getattr__
module = __import__(self._object_origins[name], None, None, [name])
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/_state.py", line 20, in <module>
from celery.utils.threads import LocalStack
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/utils/__init__.py", line 405, in <module>
from .functional import chunks, noop # noqa
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/utils/functional.py", line 19, in <module>
from kombu.utils.compat import OrderedDict
ImportError: cannot import name OrderedDict
Running:
/var/lib/awx/venv/awx/bin/pip install -U asgi-amqp==1.1.2 kombu==3.0.37
and holding back kombu appears to have worked. No more Connection reset by peer
errors and the job details load!
ENVIRONMENT
AWX version: 2.0.0
AWX install method: docker on linux
Ansible version: 2.6.5
Operating System: Ubuntu 18.04
Web Browser: Firefox/Chrome
@taspotts thanks for the feedback. We've merged the asgi_amqp
update and are planning to release it in a new version of awx in the near future.
@boris-42 @strawgate @DBLaci @nmpacheco and others who have encountered the Connection reset by peer errors
: we've released a new version of awx, 2.0.1, which we believe should resolve this issue. Please give it a shot and let us know if you continue to encounter issues!
I also had this error and verified that it was fixed in the latest released docker-image.
Thanks for addressing this issue!
Closing this, please reopen if it persists.
@ryanpetrello thanks for fixing this, I checked it finally yesterday, everything works.
ISSUE TYPE
COMPONENT NAME
SUMMARY
Job details and Job view not working properly
ENVIRONMENT
STEPS TO REPRODUCE
Run any playbook, failed and succeeded jobs are present but not showing any details.
EXPECTED RESULTS
Details from jobs
ACTUAL RESULTS
Nothing is showing, no errors, no timeouts, just nothing
ADDITIONAL INFORMATION
For example I have a failed job. When clicking on details, I can see the URL changing to: https://awx-url/#/jobz/project/
However nothing happens. When using right mouse button and opening in new tab/page I will only get the navigation pane and a blank page.
Same happens when I click on the job it self.
Additionaly, adding inventory sources works fine, however when navigating to 'Schedule inventory sync' I can see the the gear-wheel spinning but also nothing happens. I did a fresh installation today (9th May)