Job details and Job view not working

Borrelworst commented 6 years ago

ISSUE TYPE

Bug Report

COMPONENT NAME

UI

SUMMARY

Job details and Job view not working properly

ENVIRONMENT

AWX version: 1.0.6.5
AWX install method: docker on linux
Ansible version: 2.5.2
Operating System: RedHat 7.4
Web Browser: Firefox/Chrome

STEPS TO REPRODUCE

Run any playbook, failed and succeeded jobs are present but not showing any details.

EXPECTED RESULTS

Details from jobs

ACTUAL RESULTS

Nothing is showing, no errors, no timeouts, just nothing

ADDITIONAL INFORMATION

For example I have a failed job. When clicking on details, I can see the URL changing to: https://awx-url/#/jobz/project/ However nothing happens. When using right mouse button and opening in new tab/page I will only get the navigation pane and a blank page. Same happens when I click on the job it self.

Additionaly, adding inventory sources works fine, however when navigating to 'Schedule inventory sync' I can see the the gear-wheel spinning but also nothing happens. I did a fresh installation today (9th May)

grahamneville commented 6 years ago

Any suggestions on what can be done to troubleshoot this further please?

SamKirsch10 commented 6 years ago

Also having this problem in k8s. Tried a few things listed here, but still will randomly get closed sockets even when directly connected to the web container. If there are any debugging things to run, I can do so if needed.

jakemcdermott commented 6 years ago

I'm unclear on what might be causing the closed sockets SamKirsch mentioned, but that sounds like a deeper, different issue and one not entirely constrained to the job details page?

There are some race conditions involving setting up the initial connection to the job details page that have been resolved downstream and will be landing in AWX shortly.

These changes might resolve some of the issues mentioned by others above - one way to know if they will help is if you're currently still able to see dynamic updates to socket-driven content other than the incoming output lines (status icons, elapsed times, project updates, etc.).

If nothing is updating dynamically anywhere on the app during job runs then this points to a potentially deeper configuration issue. If this is the case for you it might be worth opening a separate github issue (or visiting our IRC channel) to help in tracking your specific problem down, as there are many different potential underlying causes for socket connectivity issues.

SamKirsch10 commented 6 years ago

The closed sockets I am talking about are all in this thread. Closed websockets. I notice closed websockets after an unspecified time (it's not always the same) when I try to view job details and also jobs that are running / have run. This does not mean it never shows, sometimes a full container restart lets everything show again. I hope the upcoming upstream changed will help :)

grahamn-gr commented 6 years ago

So I've found the reason for my issues and why I couldn't see the job details. It was down to the chrome version I had installed.

61.0.3163.79 caused issues where the 'working' wheel was just spinning. Upgrading to 67.0.3396.99 fixed these issues and I can now see the job details.

dadudu81 commented 6 years ago

@grahamn-gr Thanks for your answer, I updated my chrome to newest version and the problem solved!

ryanpetrello commented 6 years ago

It sounds like a number of people are having better luck with a newer version of Chrome, though from the variety of comments, it feels like this ticket has become a catch-all for any sort of odd bug related to the job details page.

I'm going to go ahead and close this; if anybody continues to encounter issues in 1.0.7, please let us know by filing a new issue with details.

boris-42 commented 6 years ago

@ryanpetrello jfyi still facing this issue, version 1.0.7.2

ryanpetrello commented 6 years ago

@boris-42 can you provide the environment details from https://github.com/ansible/awx/issues/new?template=bug_report.md, including web browser version?

boris-42 commented 6 years ago

@ryanpetrello

We are using official image 1.0.7.2
Web browser is not the problem (we tried on different, on different OS)

Some observation:

if we curl this "api/v2/jobs//stdout/" it's empty
After restart of awx web and awx task it gets populated
In logs of awx task we see " File "/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py", line 1169, in _websocket_emit_status" the same as in one of aboves comments
After restarting it works for ~15 minutes
Seems like problem between awx-task and rabbitmq...

ryanpetrello commented 6 years ago

It sounds to me like job events aren't being saved into the database. This can be caused by a number of things. Do you see anything when you visit /api/v2/jobs/N/event/?

boris-42 commented 6 years ago

@ryanpetrello I suspect you meant jobs_events.

it returns

{
  "count": 0, 
  "next": null, 
  "previous": null, 
  "results": []
}

If I restart awx-task and awx-web this information gets populated. And it continues working until we see in awx-task that log message related to rabbitmq

ryanpetrello commented 6 years ago

Yep, that's exactly what I meant, thanks :)

In your awx task container, can you run:

supervisorctl -c /supervisor_task.conf status

boris-42 commented 6 years ago

@ryanpetrello

bash-4.2$ supervisorctl -c /supervisor_task.conf status
awx-config-watcher                  RUNNING   pid 195, uptime 12:38:18
tower-processes:callback-receiver   RUNNING   pid 199, uptime 12:38:18
tower-processes:celery              RUNNING   pid 196, uptime 12:38:18
tower-processes:celery-watcher      RUNNING   pid 198, uptime 12:38:18
tower-processes:channels-worker     RUNNING   pid 197, uptime 12:38:18

boris-42 commented 6 years ago

@ryanpetrello

Some more information:

If I create schedule and run jobs every 3-5 minutes it works perfectly
If I create schedule and run jobs with gap of 20 minutes it stops working

boris-42 commented 6 years ago

@ryanpetrello Some more details. Bug is reproduced on many version of AWX.

If i run /usr/bin/awx-manage run_callback_receiver in task container

All results get send to database...

More interesting thing is this piece of code: https://github.com/ansible/awx/blob/devel/awx/main/management/commands/run_callback_receiver.py#L233-L238

If something happens to rabbitmq and we got broken connection it's not recrated, from other side we have large try/except in code that uses connection, which doesn't let run_callback_reciever crash so supervisor will be bring it back...

ryanpetrello commented 6 years ago

@boris-42 the example you linked is catching KeyboardInterrupt - I'd expect the callback receiver to gracefully handle and recover from AMQP unavailability in the way you described (testing this a bit myself).

ryanpetrello commented 6 years ago

I'm having a hard time reproducing this by stopping RabbitMQ - the callback receiver recovers for me after stopping and starting the message broker:

It also seems resilient to me screwing w/ TCP via tcpkill:

ryanpetrello commented 6 years ago

@boris-42 do you see any logs in the task container for the callback receiver that might provide some hints?

josemgom commented 6 years ago

IMHO, I don't know why this issue is closed when is still happening, even with the recent versions.

ryanpetrello commented 6 years ago

@josemgom the reason it's closed is that the original reporter described their issue and found a solution to it here: https://github.com/ansible/awx/issues/1861#issuecomment-388286258

(also, see: https://github.com/ansible/awx/issues/1861#issuecomment-415033350)

The number of people chiming in on this one has generated a lot of noise; it's likely people are encountering a number of issues across a variety of configurations that are being conflated:

some people are using older awx versions with resolved bugs
some are deploying behind a proxy and needed additional X-Forwarded-For configuration
some have reported that things work better with a newer version of Chrome

If you're still encountering an issue with the job details page, and you're using the most recent version of awx, and none of the suggestions in this comment thread have addressed it for you, then please open a new issue with as much detail as possible about the problem you're encountering: https://github.com/ansible/awx/issues/new?template=bug_report.md

In the meantime, I and other awx maintainers are happy to help as much as possible here (see my and others' various interactions with people above) and in our IRC room on freenode (#awx-devel).

boris-42 commented 6 years ago

@ryanpetrello you are back ! =)

Steps to reproduce:

My production deployment is running on top of k8s and looks like, this: -- awx-rabbitmq is statefulset with 3 replicas -- memcahced and postgres are 2 deployments -- awx-web is coupled with awx-task in the same pod as part of one deployment (there is some bug that we are still debugging that is blocking us from decoupling)
After deploying everything, don't touch anything for 15+ minutes
Run any job template (demo one for example)
You won't see the logs in output
If you restart callback receiver logs are populated
(if you don't run anything for next 15 minutes issue is going to be reproduced)

ryanpetrello commented 6 years ago

Hey @boris-42,

Do you see any logs in the task container for the callback receiver that might provide some hints? Errors/exceptions/tracebacks?

ryanpetrello commented 6 years ago

@boris-42 @strawgate @DBLaci @nmpacheco and others who have encountered the Connection reset by peer errors: we think we might have an idea of what's causing this issue. If any of you are feeling like experimenting, could you give this PR a try in your environments to see if it improves things?

https://github.com/ansible/awx/pull/2391

Alternatively, you could try running something like this (in all of your containers) and then restarting awx services to get the latest version:

~ /var/lib/awx/venv/awx/bin/pip uninstall asgi-amqp
~ /var/lib/awx/venv/awx/bin/pip install "asgi-amqp==1.1.2"

boris-42 commented 6 years ago

@ryanpetrello Thanks, I'll try to patch container this weekend!

josemgom commented 6 years ago

Thanks @ryanpetrello

I just upgraded the package in my development and production envs. I let you know if the users still facing this issue.

taspotts commented 6 years ago

Running: /var/lib/awx/venv/awx/bin/pip install -U asgi-amqp==1.1.2 brought in a newer version of kombu 4.2.1 which starts breaking daphne/celery badly.

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/bin/daphne", line 11, in <module>
    sys.exit(CommandLineInterface.entrypoint())
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/daphne/cli.py", line 144, in entrypoint
    cls().run(sys.argv[1:])
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/daphne/cli.py", line 174, in run
    channel_layer = importlib.import_module(module_path)
  File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/lib/python2.7/site-packages/awx/asgi.py", line 9, in <module>
    prepare_env() # NOQA
  File "/usr/lib/python2.7/site-packages/awx/__init__.py", line 55, in prepare_env
    if not settings.DEBUG: # pragma: no cover
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 56, in __getattr__
    self._setup(name)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 41, in _setup
    self._wrapped = Settings(settings_module)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 110, in __init__
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/lib/python2.7/site-packages/awx/settings/production.py", line 17, in <module>
    from defaults import *  # NOQA
  File "/usr/lib/python2.7/site-packages/awx/settings/defaults.py", line 7, in <module>
    import djcelery
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/djcelery/__init__.py", line 34, in <module>
    from celery import current_app as celery  # noqa
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/five.py", line 312, in __getattr__
    module = __import__(self._object_origins[name], None, None, [name])
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/_state.py", line 20, in <module>
    from celery.utils.threads import LocalStack
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/utils/__init__.py", line 405, in <module>
    from .functional import chunks, noop                    # noqa
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/utils/functional.py", line 19, in <module>
    from kombu.utils.compat import OrderedDict
ImportError: cannot import name OrderedDict

Running: /var/lib/awx/venv/awx/bin/pip install -U asgi-amqp==1.1.2 kombu==3.0.37 and holding back kombu appears to have worked. No more Connection reset by peer errors and the job details load!

ENVIRONMENT

AWX version: 2.0.0
AWX install method: docker on linux
Ansible version: 2.6.5
Operating System: Ubuntu 18.04
Web Browser: Firefox/Chrome

ryanpetrello commented 6 years ago

@taspotts thanks for the feedback. We've merged the asgi_amqp update and are planning to release it in a new version of awx in the near future.

ryanpetrello commented 6 years ago

@boris-42 @strawgate @DBLaci @nmpacheco and others who have encountered the Connection reset by peer errors: we've released a new version of awx, 2.0.1, which we believe should resolve this issue. Please give it a shot and let us know if you continue to encounter issues!

nightvisi0n commented 6 years ago

I also had this error and verified that it was fixed in the latest released docker-image.
Thanks for addressing this issue!

wenottingham commented 6 years ago

Closing this, please reopen if it persists.

boris-42 commented 6 years ago

@ryanpetrello thanks for fixing this, I checked it finally yesterday, everything works.

ansible / awx