ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.12k stars 3.43k forks source link

Job details and Job view not working #1861

Closed Borrelworst closed 6 years ago

Borrelworst commented 6 years ago
ISSUE TYPE
COMPONENT NAME
SUMMARY

Job details and Job view not working properly

ENVIRONMENT
STEPS TO REPRODUCE

Run any playbook, failed and succeeded jobs are present but not showing any details.

EXPECTED RESULTS

Details from jobs

ACTUAL RESULTS

Nothing is showing, no errors, no timeouts, just nothing

ADDITIONAL INFORMATION

For example I have a failed job. When clicking on details, I can see the URL changing to: https://awx-url/#/jobz/project/ However nothing happens. When using right mouse button and opening in new tab/page I will only get the navigation pane and a blank page. Same happens when I click on the job it self.

Additionaly, adding inventory sources works fine, however when navigating to 'Schedule inventory sync' I can see the the gear-wheel spinning but also nothing happens. I did a fresh installation today (9th May)

matthew-hickok commented 6 years ago

I am experiencing the same issue.

anasypany commented 6 years ago

What are you using for a proxy in front of AWX? Do you have your awx_web container bound to 0.0.0.0:port or 127.0.0.1:port? I was experiencing the same issue while accessing AWX behind a nginx proxy running on the Linux host and noticed that when the proxy was disabled the Job detail pages would display properly. After I set the awx_web container to listen on 127.0.0.1, I was longer experiencing the issue. To set the awx_web container to 127.0.0.1, you can specify host_port=127.0.0.1:port (instead of host_port=port) in the installer inventory file.

cstuart1 commented 6 years ago

I'm having the same issue where the job details will not display (also running with a proxy in front of awx). Adjusting the awx_web container to listen on 127.0.0.1 did not resolve the issue. Prior to upgrading to 1.0.6.5 this was working properly.

ENVIRONMENT AWX version: 1.0.6.5 AWX install method: docker on linux Ansible version: 2.5.2 Operating System: Ubuntu 16.04 Web Browser: Firefox/Chrome

In developer tools I'm seeing this error: WebSocket connection to 'wss://<>/websocket/' failed: WebSocket is closed before the connection is established.

where the <> is the correct uri to my instance.

"/#/jobs?job_search=page_size:20;order_by:-finished;not__launch_type:sync:1 /#/jobz/inventory/33:1". I am also usning Nginx as a front end proxy (port 443).

Borrelworst commented 6 years ago

Thanks for the tip @anasypany and for trying this solution @cstuart1. I indeed also use nginx as front-end proxy as I need SSL and port 443. What I haven't tried yet is via a ssh-tunnel directly connecting to the awx_web container. If the issue then still persist it is in the application itself. However I will not be able to test this today, but it will be the first thing I will do tomorrow morning.

anasypany commented 6 years ago

@cstuart1 Can you paste your ngxinx proxy config? (with censored environment details of course)

cstuart1 commented 6 years ago

@Borrelworst The solution here is to add in a block for the websocket in your Nginx config

location /websocket { proxy_pass http://x.x.x.x:80; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "Upgrade"; }

@anasypany this is probably what you were going to suggest/inquire about?

anasypany commented 6 years ago

@cstuart1 I was able to get the job details pages working again with this simple nginx proxy config once awx_web was bound to 127.0.0.1:

location / { proxy_pass http://127.0.0.1:xxxx; (xxxx = 80 in your case) proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; }

If you try this config make sure to add HTTP_X_FORWARDED_FOR in your Remote Host Headers on AWX as well. Let me know if you have any luck!

cstuart1 commented 6 years ago

Yes, that resolved the issue for me. I had already added HTTP_X_FORWARDED_FOR to AWX as I'm using SAML for auth.

For someone else reading this thread and trying to setup SAML. I also had to alter /etc/tower/settings.py (task and web) to have the following: USE_X_FORWARDED_PORT = True USE_X_FORWARDED_HOST = True

and restart tower after making the setting change. This is mentioned in the tower documents but I thought I would post this in-case someone else read this thread.

Borrelworst commented 6 years ago

@cstuart1: That indeed solved the issue. I have not set the awx_web to bound explicitly to 127.0.0.1 and apparently that is not needed. The only issue I still see is that when I go to my custom inventory scripts and click on schudule inventory syncs, I will just see the cog wheel, but nothing happens. This is also described in #1850.

nmpacheco commented 6 years ago

I am also experiencing problems with job details. I deployed a stack with postgres, rabbitmq, memcache, awx_web and awx_task in a swarm (ansible role to check variables, create dirs, instantiating a docker-compose template, deploy and so on). I am using vfarcic docker-flow to provide access to all the services in the swarm and to automatically detect changes in the configuration and reflect those changes in the proxy configuration. Within this stack, only awx_web is provided access outside the swarm with the docker-flow stack. All works well except that the websocket of the job listing and details works only during rare intervals, usually, when repeated killing daphne and nginx inside awx_web container. Debugging in the browser, I can see a bunch of websocket upgrades being tried and all of them failing with "502 Bad Gateway" after 5/6 seconds. At the same time, for each of the failing websockets attempts, a message like the one bellow appears in the awx_web log:

2018/05/16 23:36:18 [error] 31#0: *543 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <internal proxy ip>, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "<my specific virtual host>"

Occasionally, the following messages are also printed in the same log:

127.0.0.1:59526 - - [16/May/2018:19:22:54] "WSCONNECTING /websocket/" - -
127.0.0.1:59526 - - [16/May/2018:19:22:54] "WSCONNECT /websocket/" - -
127.0.0.1:59526 - - [16/May/2018:19:22:55] "WSDISCONNECT /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:55] "WSCONNECTING /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:55] "WSCONNECT /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:56] "WSDISCONNECT /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:06] "WSCONNECTING /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:06] "WSCONNECT /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:21] "WSDISCONNECT /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:23:27] "WSCONNECTING /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:23:27] "WSCONNECT /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:25:05] "WSDISCONNECT /websocket/" - -
127.0.0.1:34510 - - [16/May/2018:22:42:34] "WSDISCONNECT /websocket/" - -
127.0.0.1:34710 - - [16/May/2018:22:42:43] "WSCONNECTING /websocket/" - -
127.0.0.1:34710 - - [16/May/2018:22:42:48] "WSDISCONNECT /websocket/" - -
127.0.0.1:34794 - - [16/May/2018:22:42:57] "WSCONNECTING /websocket/" - -
127.0.0.1:34794 - - [16/May/2018:22:43:02] "WSDISCONNECT /websocket/" - -
(...)
127.0.0.1:35964 - - [16/May/2018:23:35:48] "WSDISCONNECT /websocket/" - -
127.0.0.1:37394 - - [16/May/2018:23:35:52] "WSCONNECTING /websocket/" - -
127.0.0.1:37312 - - [16/May/2018:23:35:52] "WSDISCONNECT /websocket/" - -
127.0.0.1:37412 - - [16/May/2018:23:35:57] "WSCONNECTING /websocket/" - -
127.0.0.1:37394 - - [16/May/2018:23:35:57] "WSDISCONNECT /websocket/" - -

The haproxy config generated by docker-flow for this service (awx_web) is:

frontend services
(...)
    acl url_awx-stack_awxweb8052_0 path_beg /
    acl domain_awx-stack_awxweb8052_0 hdr_beg(host) -i <my specific virtual host>
    use_backend awx-stack_awxweb-be8052_0 if url_awx-stack_awxweb8052_0 domain_awx-stack_awxweb8052_0
(...)
backend awx-stack_awxweb-be8052_0
    mode http
    http-request add-header X-Forwarded-Proto https if { ssl_fc }
    http-request add-header X-Forwarded-For %[src]
    http-request add-header X-Client-IP %[src]
    http-request add-header Upgrade "websocket"
    http-request add-header Connection "upgrade"
    server awx-stack_awxweb awx-stack_awxweb:8052

It is very similar to a bunch of other services in the swarm. As far as I can understand, the upstream referenced in the message above refers to daphne inside the awx_web container, that daphne instance is listening on the http://127.0.0.1:8051 and is "called" by the proxy configuration of the nginx, also running inside the same container. I am currently investigating how can one troubleshoot daphne. I would appreciate if anyone can help me with some ideas or guidelines to proceed with the investigations. Thanks!

leweafan commented 6 years ago

I'm experiencing the same issue

ENVIRONMENT AWX version: 1.0.6.8 AWX install method: docker on linux Ansible version: 2.5.2 Operating System: Debian 9 Web Browser: Firefox/Chrome

mkoshevoi commented 6 years ago

I have the same issue either

Rpera commented 6 years ago

Hi, I had the same issue and i was able to get the jobs output running this command to fix the permissions:

matthew-hickok commented 6 years ago

Since most of these comments are related to proxy configurations, I should probably mention that I have the same issue but I do not have a proxy in front of mine.

SatiricFX commented 6 years ago

I'm experiencing the same issue as well. Initially will work fine. I noticed restarting the containers/docker resolves the issue. Will monitor it to determine if issue occurs again, which I assume it will.

cavamagie commented 6 years ago

same error i use nginx with configuration similar to @anasypany

location / {
    proxy_pass http://127.0.0.1:8052;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
}

but i'm unable to see the job

bogdansharuk commented 6 years ago

@cavamagie

ENVIRONMENT

cat awx/installer/inventory

host_port=127.0.0.1:9999

location / {
    proxy_pass http://127.0.0.1:9999/;
    proxy_http_version 1.1;
    proxy_set_header Host               $host;
    proxy_set_header X-Real-IP          $remote_addr;
    proxy_set_header X-Forwarded-For    $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto  $scheme;
    proxy_set_header Upgrade            $http_upgrade;
    proxy_set_header Connection         "upgrade";
} 

It works for me

sudomateo commented 6 years ago

@cstuart1 Do you think we can chat out of band regarding SAML setup with AWX? I've been at this for hours with no success.

Edit: I commented on #1016 with details on how to configure AWX for use with SAML auth.

piroux commented 6 years ago

Same issues @SatiricFX I have noticed the same thing: restarting the docker containers usually helps. Moreover, I am not using any proxy nor https access.

SatiricFX commented 6 years ago

@piroux That does resolve it for us as well temporarily. Haven't found a permanent fix for it. Maybe a bug.

strawgate commented 6 years ago

It appears you can swap the supervisor.conf and add verbose output to daphne:

[program:daphne]
command = /var/lib/awx/venv/awx/bin/daphne -b 127.0.0.1 -p 8051 awx.asgi:channel_layer -v 2

With this I am seeing the following behavior related to websockets from Daphne/nginx:

2018-06-27 03:18:59,295 DEBUG    Upgraded connection daphne.response.XbupPxYRcS!BfsxXxiUPF to WebSocket daphne.response.XbupPxYRcS!ReBXomhGtg
RESULT 2
OKREADY
10.255.0.2 - - [27/Jun/2018:03:19:02 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:03,491 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!ReBXomhGtg
2018-06-27 03:19:21,372 DEBUG    Upgraded connection daphne.response.XbupPxYRcS!aPmLgJGDZd to WebSocket daphne.response.XbupPxYRcS!hTzJudfDoM
10.255.0.2 - - [27/Jun/2018:03:19:24 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:25,571 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!hTzJudfDoM
2018-06-27 03:19:50,862 DEBUG    Upgraded connection daphne.response.XbupPxYRcS!lnvEJzPynj to WebSocket daphne.response.XbupPxYRcS!XCyaFNijYM
10.255.0.2 - - [27/Jun/2018:03:19:53 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:53,999 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!XCyaFNijYM
RESULT 2
OKREADY

This eventually logs:

2018-06-27 03:34:03,939 WARNING  dropping connection to peer tcp4:127.0.0.1:34576 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
10.255.0.2 - - [27/Jun/2018:03:34:03 +0000] "GET /websocket/ HTTP/1.1" 502 575 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018/06/27 03:34:03 [error] 32#0: *147 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.255.0.2, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "localhost:8080"
2018-06-27 03:34:03,941 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!gbrIRtuqeq
DBLaci commented 6 years ago

awx_web:1.0.6.23 here:

10.255.0.2 - - [28/Jun/2018:13:31:14 +0000] "GET /websocket/ HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36"
2018/06/28 13:31:14 [error] 25#0: *440 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.255.0.2, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "awx.prmrgt.com:80"
10.255.0.2 - - [28/Jun/2018:13:31:19 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36"

etc. websocket simply not working. All reverse proxy configuration was working before (1.0.3.29 for example). nginx config is fine:

      location / {
        proxy_pass http://10.20.1.100:8053/;
        proxy_http_version 1.1;
        proxy_set_header   Host               $host:$server_port;
        proxy_set_header   X-Real-IP          $remote_addr;
        proxy_set_header   X-Forwarded-For    $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto  $scheme;
        proxy_set_header   Upgrade            $http_upgrade;
        proxy_set_header   Connection         "upgrade";
      }

I appended these lines to /etc/tower/settings.py:

USE_X_FORWARDED_PORT = True
USE_X_FORWARDED_HOST = True

I found ansible/awx_web:1.0.6.11 is the latest image working fine for me (this means the websocket reverse proxy settings are fine outside the awx_web!). I hope this helps.

Please not the settings.py changes are not needed for 1.0.6.11 to work. I don't see any impact it I set those or not.

josemgom commented 6 years ago

I am also facing the same issue.

ENVIRONMENT

They only workaround that is currently working for me is stopping everything and starting again the containers.

strawgate commented 6 years ago

This issue does not appear to occur for a little while after redeploying AWX.

I did however notice that none of the job details from while this issue is occuring are available even after you restart. It appears as though the "stdout" response on the API is populated via the task container posting data to a websocket for that job.

I also noticed that when the issue is occurring that the task container fails with the following errors:

[2018-07-02 19:03:47,717: DEBUG/Worker-4] using channel_id: 2
2018-07-02 19:03:47,718 ERROR    awx.main.models.unified_jobs job 15 (running) failed to emit channel msg about status change
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py", line 1169, in _websocket_emit_status
    emit_channel_notification('jobs-status_changed', status_data)
  File "/usr/lib/python2.7/site-packages/awx/main/consumers.py", line 70, in emit_channel_notification
    Group(group).send({"text": json.dumps(payload, cls=DjangoJSONEncoder)})
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/channels/channel.py", line 88, in send
    self.channel_layer.send_group(self.name, content)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 190, in send_group
    self.send(channel, message)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 95, in send
    self.recover()
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 77, in recover
    self.tdata.consumer.revive(self.tdata.connection.channel())
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/connection.py", line 255, in channel
    chan = self.transport.create_channel(self.connection)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 92, in create_channel
    return connection.channel()
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/connection.py", line 282, in channel
    return self.Channel(self, channel_id)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py", line 101, in __init__
    self._x_open()
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py", line 427, in _x_open
    self._send_method((20, 10), args)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
    self.channel_id, method_sig, args, content,
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
    write_frame(1, channel, payload)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 104] Connection reset by peer

This would explain why the job details from jobs that ran while the websockets are not working arent even visible after restarting the web/task container and why they arent available when hitting the stdout resource on the job endpoint

stmarier commented 6 years ago

I ran into this issue as well and resolved it by stopping both the web and task containers and rerunning the installer playbook to start them again.

jkhelil commented 6 years ago

we have the issue with 1.0.6.0 and not recovering after deleting/recreating pods for awx and etcd

jijojv commented 6 years ago

restarting web/task on one dev host where i was testing directly fixed it.

In Prod i'm facing issues with websocket errors behind custom reverse-proxies - Is it possible via some header hack to disable websocket completely or is that a hard requirement for awx - some libraries have fallback options ?

strawgate commented 6 years ago

Decided to take a look at the rabbitmq logs and when websockets stops working I start seeing the following in the logs:

2018-07-07 00:56:02.000 [warning] <0.5148.0> closing AMQP connection <0.5148.0> (10.0.0.6:54140 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.001 [warning] <0.5138.0> closing AMQP connection <0.5138.0> (10.0.0.6:54138 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.001 [warning] <0.4690.0> closing AMQP connection <0.4690.0> (10.0.0.6:53950 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.055 [warning] <0.5182.0> closing AMQP connection <0.5182.0> (10.0.0.6:54150 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.056 [warning] <0.5172.0> closing AMQP connection <0.5172.0> (10.0.0.6:54148 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.057 [warning] <0.4731.0> closing AMQP connection <0.4731.0> (10.0.0.6:53974 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.058 [warning] <0.5192.0> closing AMQP connection <0.5192.0> (10.0.0.6:54198 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
anthonyloukinas commented 6 years ago

We're getting the following error everytime we try to click on a job, both running and ones that have already been completed.

WebSocket connection to 'wss://{redacted}/websocket/' failed: WebSocket is closed before the connection is established.

We experienced this both on the latest AWX Web version and on several older revisions. ansible/awx_web:1.0.6.11 in particular was what we tried.

It's worth noting this container sits behind a reverse nginx proxy, but we've tried narrowing this down by removing the proxy all together and still are getting the same errors/issue. We use this very heavily in production, are there any short-term fixes? Container reboots sometimes work for a few minutes, but typically fall back to the same errors.

Logs on AWX Web don't show anything overly useful, and likewise with postgres and task containers. RabbitMQ does show similar results as stated above.

2018-07-09 12:29:48.398 [warning] <0.11522.5> closing AMQP connection <0.11522.5> (10.0.5.240:40382 -> 10.0.5.234:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-09 12:29:48.398 [warning] <0.17632.5> closing AMQP connection <0.17632.5> (10.0.5.240:46896 -> 10.0.5.234:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-09 12:29:48.399 [warning] <0.23641.5> closing AMQP connection <0.23641.5> (10.0.5.240:53386 -> 10.0.5.234:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
JSkier21 commented 6 years ago

Seeing this as well with AWX 1.0.6.25 and Asnible 2.6.1.

EDIT: 1.0.6.1 also seems to not work.

Any page requested like this never completely loads and is blank: https://awx/jobs/playbook/8

Playbooks do actually run (and sometimes fail) which works fine for notifications.

ghost commented 6 years ago

Same behavior, but not seeing any of the errors others. Also, restarting the pod doesn't fix the issue for any amount of time. It looks like I'm just being sent back to the jobs list page.


10.32.5.17 - - [12/Jul/2018:15:50:50 +0000] "PROXY TCP4 10.32.44.94 10.32.44.94 41275 32132" 400 173 "-" "-"
[pid: 37|app: 0|req: 77/525] 10.244.8.0 () {48 vars in 3205 bytes} [Thu Jul 12 15:50:51 2018] GET /api/v2/inventory_updates/9/ => generated 4586 bytes in 104 msecs (HTTP/1.1 200) 8 headers in 248 bytes (1 switches on core 0)
10.244.8.0 - - [12/Jul/2018:15:50:51 +0000] "GET /api/v2/inventory_updates/9/ HTTP/1.1" 200 4586 "https://awx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
10.244.6.0 - - [12/Jul/2018:15:50:51 +0000] "OPTIONS /api/v2/inventory_updates/9/ HTTP/1.1" 200 11892 "https://awx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
[pid: 33|app: 0|req: 238/526] 10.244.6.0 () {50 vars in 3249 bytes} [Thu Jul 12 15:50:51 2018] OPTIONS /api/v2/inventory_updates/9/ => generated 11892 bytes in 149 msecs (HTTP/1.1 200) 8 headers in 249 bytes (1 switches on core 0)
10.244.10.0 - - [12/Jul/2018:15:50:51 +0000] "GET /api/v2/inventory_updates/9/events/?order_by=start_line&page=1&page_size=50 HTTP/1.1" 200 17126 "https://awx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
[pid: 36|app: 0|req: 123/527] 10.244.10.0 () {48 vars in 3299 bytes} [Thu Jul 12 15:50:51 2018] GET /api/v2/inventory_updates/9/events/?order_by=start_line&page=1&page_size=50 => generated 17126 bytes in 90 msecs (HTTP/1.1 200) 9 headers in 264 bytes (1 switches on core 0)

AWX 1.0.6.17 Ansible 2.5.5 running on Kubernetes

ghost commented 6 years ago

@Borrelworst Hey friend, would you be able to paste your entire nginx.conf file? I am having the exact same issue but adding the stanza above did not fix my issue.

This is mine fwiw. `#user awx;

    worker_processes  1;

    pid        /tmp/nginx.pid;

    events {
        worker_connections  1024;
    }

    http {
        include       /etc/nginx/mime.types;
        default_type  application/octet-stream;

        log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for"';

        map $http_upgrade $connection_upgrade {
            default upgrade;
            ''      close;
        }

        sendfile        on;
        #tcp_nopush     on;
        #gzip  on;

        upstream uwsgi {
            server 127.0.0.1:8050;
            }

        upstream daphne {
            server 127.0.0.1:8051;
        }

        server {
            listen 8052 default_server;

            # If you have a domain name, this is where to add it
            server_name _;
            keepalive_timeout 65;

            # HSTS (ngx_http_headers_module is required) (15768000 seconds = 6 months)
            add_header Strict-Transport-Security max-age=15768000;

            location /nginx_status {
              stub_status on;
              access_log off;
              allow 127.0.0.1;
              deny all;
            }

            location /static/ {
                alias /var/lib/awx/public/static/;
            }

            location /favicon.ico { alias /var/lib/awx/public/static/favicon.ico; }

            location ~ ^/(websocket|network_ui/topology/) {
                # Pass request to the upstream alias
                proxy_pass http://daphne;
                # Require http version 1.1 to allow for upgrade requests
                proxy_http_version 1.1;
                # We want proxy_buffering off for proxying to websockets.
                proxy_buffering off;
                # http://en.wikipedia.org/wiki/X-Forwarded-For
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                # enable this if you use HTTPS:
                proxy_set_header X-Forwarded-Proto https;
                # pass the Host: header from the client for the sake of redirects
                proxy_set_header Host $http_host;
                # We've set the Host header, so we don't need Nginx to muddle
                # about with redirects
                proxy_redirect off;
                # Depending on the request value, set the Upgrade and
                # connection headers
                proxy_set_header Upgrade $http_upgrade;
                proxy_set_header Connection $connection_upgrade;
            }

            location / {
                # Add trailing / if missing
                rewrite ^(.*)$http_host(.*[^/])$ $1$http_host$2/ permanent;
                uwsgi_read_timeout 120s;
                uwsgi_pass uwsgi;
                include /etc/nginx/uwsgi_params;
            }
        }
    }`
anthonyloukinas commented 6 years ago

PSA: If anyone here is using Docker SWARM and having these issues, try to run the same stack just using docker-compose (non-swarm v2), and see if you have the same issues.

The issues in this thread were all symptoms we were seeing whilst running in Swarm mode. Once we switched to local instances (docker-compose), we haven't had any issues running AWX behind an Nginx Proxy (specifically Jwilder's with custom SSL Certificates).

Just wanted to toss this tidbit out there. RedHat/AWX team has specifically stated AWX is NOT swarm supported, but I know it makes sense for a lot of people to use Swarm.

JSkier21 commented 6 years ago

@anthonyloukinas, I'm not in swam and using docker-compose and it doesn't display job status properly at all.

Borrelworst commented 6 years ago

@hitmenow Here bellow is my server block, I left the original congifuration intact, but just created a conf file in conf.d:

   server {
   ssl   on;

   listen       443 ssl default_server;
   server_name <servername>;
   ssl_certificate <certfile>;
   ssl_certificate_key <keyfile>;
   proxy_set_header    X-Forwarded-For    $remote_addr;
   include /etc/nginx/default.d/*.conf;

   location / {
        proxy_pass http://localhost:80/;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
   }

   error_page 404 /404.html;
       location = /40x.html {
   }

   error_page 500 502 503 504 /50x.html;
       location = /50x.html {
   }

  }

It does work for me most of the time, but occasionally I have to restart docker to fix the issue again. The fact that so many people have the same issue tells me that or the documentation is not sufficient, or there is really a bug in the software causing the issue.

strawgate commented 6 years ago

@anthonyloukinas I'm not sure RedHat provides any support for AWX so it not being supported by RedHat isn't a huge deal -- we are just hoping for some help from the team to figure out what is causing this in the scenarios it's occurring in (with and without swarm) so we can contribute an open-source fix -- nobody seems to be providing any guidance or insight, which is understandable, but in my opinion we should keep collecting more information here.

konkolorado commented 6 years ago

What I've noticed is once websockets stop working, subsequent attempts at the websocket opening handshake never complete. Running tcpdump on the web container on port 8051 shows web never sends out the accept-upgrade response.

I've traced the websocket connect request path and it's kind of messy. A websocket request gets handled by web but web defers responding to the handshake. Instead what happens is web creates a message on rabbitmq that a websocket connect was received. Task then picks up this message, puts a message back on rabbitmq with the contents {"accept": True}, and once web receives this message it sends out the handshake response to the client, successfully establishing a websocket connection.

What seems to be happening is that, at some point, there is a mismatch between the channels where web and task look for and place their messages (i.e. web listens for accept messages on channel A but task is sending those messages on channel B). Restarting the supervisor deamons on web and task at the same time (and other workarounds) seem to fix the issue but only temporarily. I'm also not sure why web isn't handling the websocket handshake response itself.

Full disclosure, I've only been running into these problems when deploying AWX in a swarm environment where each container has no replicas. It looks like something about swarm is causing the channels used for communication b/t web and task to de-synchronize.

ghost commented 6 years ago

Thank you @Borrelworst! I have a different scenario than you I think. I have a load balancer in front of my containers which has SSL termination. And my nginx server is listening on 8052. Will do some more troubleshooting. Thanks again

sightseeker commented 6 years ago

I resolved when set the endpoint_mode of RabbitMQ to dnsrr in the Docker Swarm Mode. The rabbitmq stack in compsoe file is:

  rabbitmq:
    image: rabbitmq:3
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      endpoint_mode: dnsrr
    environment:
      RABBITMQ_DEFAULT_VHOST: "awx"
    networks:
      - webnet
strawgate commented 6 years ago

Switching to dnsrr instead of VIP kind of implies that it's an issue with the VIP timing out the idle connection --

https://github.com/moby/moby/issues/37466#issuecomment-405307656 https://success.docker.com/article/ipvs-connection-timeout-issue

This would match with the described behavior where it works initially and then at some undefined later time (relatively quickly) it stops working.

ghost commented 6 years ago

@sightseeker Is there an equivalent that you know of for Kubernetes deployments?

sightseeker commented 6 years ago

Thankyou @strawgate ! When I set tcp_keepalive_timeout to less than 900 secs and using vip mode, the problem no longer occurs.

@hitmenow I haven't tried yet with K8s.

strawgate commented 6 years ago

It would also imply that switching the containers to using tasks.rabbitmq to hit rabbitmq would fix the issue as that bypasses the VIP too. Will test and report back

strawgate commented 6 years ago

@hitmenow Kubernetes doesnt use VIP or swarm networking so dnsrr is probably not related to your issue.

onitake commented 6 years ago

I'm running AWX in pure docker containers on the same machine (no swarm or k8s) and I was hitting this issue too.

Setting net.ipv4.tcp_keepalive_time=600 helped me as well, but it needs to be set before daphne runs, so it should be put into /etc/sysctl.conf on the host system or similar.

josemgom commented 6 years ago

I just updated the tcp_keepalive in my staging and production environment. I will check if this solution helps to the issue.

dadudu81 commented 6 years ago

I have the same issue either

ENVIRONMENT

AWX version: 1.0.7 AWX install method: docker on linux Ansible version: 2.5.4 Operating System: CentOS 7 Web Browser: Firefox/Chrome

grahamneville commented 6 years ago

I have this issue as well. I was on 1.0.4.50 and that was working fine. I've moved up to 1.0.7.0 and now I just see a spinning 'working' wheel when try to see job history. I've tried different browsers and incognito windows but no change.

I'm running AWX just on normal docker. Not on k8s or openshift.

I was using haproxy in front for SSL offload but I still see the same if I browse to the awx_web container on its exposed web port (8052)

jakemcdermott commented 6 years ago

grahamneville - do you have any container logs we can take a look at?

grahamn-gr commented 6 years ago

@jakemcdermott

I've tried a few things, listed below, that people have suggested fixed the issue and some more but I've had no luck.

These are the logs I see from the awx_web container, I'm not seeing anything coming through at the same time on any of the other containers.

[pid: 138|app: 0|req: 29/440] 1.1.1.1 () {50 vars in 2485 bytes} [Fri Aug 17 08:16:22 2018] OPTIONS /api/v2/jobs/744/ => generated 12949 bytes in 216 msecs (HTTP/1.1 200) 10 headers in 387 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "OPTIONS /api/v2/jobs/744/ HTTP/1.1" 200 12949 "https://ourawxhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" "2.2.2.2"
[pid: 136|app: 0|req: 258/441] 1.1.1.1 () {48 vars in 2447 bytes} [Fri Aug 17 08:16:22 2018] GET /api/v2/jobs/744/ => generated 9971 bytes in 237 msecs (HTTP/1.1 200) 10 headers in 386 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "GET /api/v2/jobs/744/ HTTP/1.1" 200 9971 "https://ourawxhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" "2.2.2.2"
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "GET /api/v2/jobs/744/job_events/?order_by=-counter&page=1&page_size=50 HTTP/1.1" 200 62930 "https://ourawxhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" "2.2.2.2"
[pid: 135|app: 0|req: 29/442] 1.1.1.1 () {48 vars in 2544 bytes} [Fri Aug 17 08:16:22 2018] GET /api/v2/jobs/744/job_events/?order_by=-counter&page=1&page_size=50 => generated 62930 bytes in 415 msecs (HTTP/1.1 200) 11 headers in 402 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 259/443] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:22 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:24 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 260/444] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:24 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:26 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 261/445] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:26 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:28 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 262/446] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:28 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:30 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 137|app: 0|req: 84/447] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:30 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:32 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 263/448] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:32 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:34 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 264/449] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:34 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)

It's just the job details/history view that's a problem and the fact you don't get to see the job running in real time when you launch a new job, every other page loads fine. This is one of the URLs that I'm trying to get to, as seen when clicking on the job in the jobs view: https://ourawxhost/#/jobs/playbook/750?job_search=page_size%3A20%3Border_by%3A-finished%3Bnot__launch_type%3Async