Closed Borrelworst closed 6 years ago
I am experiencing the same issue.
What are you using for a proxy in front of AWX? Do you have your awx_web container bound to 0.0.0.0:port
or 127.0.0.1:port
? I was experiencing the same issue while accessing AWX behind a nginx proxy running on the Linux host and noticed that when the proxy was disabled the Job detail pages would display properly. After I set the awx_web container to listen on 127.0.0.1
, I was longer experiencing the issue. To set the awx_web container to 127.0.0.1
, you can specify host_port=127.0.0.1:port
(instead of host_port=port
) in the installer inventory file.
I'm having the same issue where the job details will not display (also running with a proxy in front of awx). Adjusting the awx_web container to listen on 127.0.0.1 did not resolve the issue. Prior to upgrading to 1.0.6.5 this was working properly.
ENVIRONMENT AWX version: 1.0.6.5 AWX install method: docker on linux Ansible version: 2.5.2 Operating System: Ubuntu 16.04 Web Browser: Firefox/Chrome
In developer tools I'm seeing this error:
WebSocket connection to 'wss://<
where the <
"/#/jobs?job_search=page_size:20;order_by:-finished;not__launch_type:sync:1 /#/jobz/inventory/33:1". I am also usning Nginx as a front end proxy (port 443).
Thanks for the tip @anasypany and for trying this solution @cstuart1. I indeed also use nginx as front-end proxy as I need SSL and port 443. What I haven't tried yet is via a ssh-tunnel directly connecting to the awx_web container. If the issue then still persist it is in the application itself. However I will not be able to test this today, but it will be the first thing I will do tomorrow morning.
@cstuart1 Can you paste your ngxinx proxy config? (with censored environment details of course)
@Borrelworst The solution here is to add in a block for the websocket in your Nginx config
location /websocket { proxy_pass http://x.x.x.x:80; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "Upgrade"; }
@anasypany this is probably what you were going to suggest/inquire about?
@cstuart1 I was able to get the job details pages working again with this simple nginx proxy config once awx_web was bound to 127.0.0.1:
location / { proxy_pass http://127.0.0.1:xxxx; (xxxx = 80 in your case) proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; }
If you try this config make sure to add HTTP_X_FORWARDED_FOR
in your Remote Host Headers on AWX as well. Let me know if you have any luck!
Yes, that resolved the issue for me. I had already added HTTP_X_FORWARDED_FOR to AWX as I'm using SAML for auth.
For someone else reading this thread and trying to setup SAML. I also had to alter /etc/tower/settings.py (task and web) to have the following: USE_X_FORWARDED_PORT = True USE_X_FORWARDED_HOST = True
and restart tower after making the setting change. This is mentioned in the tower documents but I thought I would post this in-case someone else read this thread.
@cstuart1: That indeed solved the issue. I have not set the awx_web to bound explicitly to 127.0.0.1 and apparently that is not needed. The only issue I still see is that when I go to my custom inventory scripts and click on schudule inventory syncs, I will just see the cog wheel, but nothing happens. This is also described in #1850.
I am also experiencing problems with job details. I deployed a stack with postgres, rabbitmq, memcache, awx_web and awx_task in a swarm (ansible role to check variables, create dirs, instantiating a docker-compose template, deploy and so on). I am using vfarcic docker-flow to provide access to all the services in the swarm and to automatically detect changes in the configuration and reflect those changes in the proxy configuration. Within this stack, only awx_web is provided access outside the swarm with the docker-flow stack. All works well except that the websocket of the job listing and details works only during rare intervals, usually, when repeated killing daphne and nginx inside awx_web container. Debugging in the browser, I can see a bunch of websocket upgrades being tried and all of them failing with "502 Bad Gateway" after 5/6 seconds. At the same time, for each of the failing websockets attempts, a message like the one bellow appears in the awx_web log:
2018/05/16 23:36:18 [error] 31#0: *543 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <internal proxy ip>, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "<my specific virtual host>"
Occasionally, the following messages are also printed in the same log:
127.0.0.1:59526 - - [16/May/2018:19:22:54] "WSCONNECTING /websocket/" - -
127.0.0.1:59526 - - [16/May/2018:19:22:54] "WSCONNECT /websocket/" - -
127.0.0.1:59526 - - [16/May/2018:19:22:55] "WSDISCONNECT /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:55] "WSCONNECTING /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:55] "WSCONNECT /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:56] "WSDISCONNECT /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:06] "WSCONNECTING /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:06] "WSCONNECT /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:21] "WSDISCONNECT /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:23:27] "WSCONNECTING /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:23:27] "WSCONNECT /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:25:05] "WSDISCONNECT /websocket/" - -
127.0.0.1:34510 - - [16/May/2018:22:42:34] "WSDISCONNECT /websocket/" - -
127.0.0.1:34710 - - [16/May/2018:22:42:43] "WSCONNECTING /websocket/" - -
127.0.0.1:34710 - - [16/May/2018:22:42:48] "WSDISCONNECT /websocket/" - -
127.0.0.1:34794 - - [16/May/2018:22:42:57] "WSCONNECTING /websocket/" - -
127.0.0.1:34794 - - [16/May/2018:22:43:02] "WSDISCONNECT /websocket/" - -
(...)
127.0.0.1:35964 - - [16/May/2018:23:35:48] "WSDISCONNECT /websocket/" - -
127.0.0.1:37394 - - [16/May/2018:23:35:52] "WSCONNECTING /websocket/" - -
127.0.0.1:37312 - - [16/May/2018:23:35:52] "WSDISCONNECT /websocket/" - -
127.0.0.1:37412 - - [16/May/2018:23:35:57] "WSCONNECTING /websocket/" - -
127.0.0.1:37394 - - [16/May/2018:23:35:57] "WSDISCONNECT /websocket/" - -
The haproxy config generated by docker-flow for this service (awx_web) is:
frontend services
(...)
acl url_awx-stack_awxweb8052_0 path_beg /
acl domain_awx-stack_awxweb8052_0 hdr_beg(host) -i <my specific virtual host>
use_backend awx-stack_awxweb-be8052_0 if url_awx-stack_awxweb8052_0 domain_awx-stack_awxweb8052_0
(...)
backend awx-stack_awxweb-be8052_0
mode http
http-request add-header X-Forwarded-Proto https if { ssl_fc }
http-request add-header X-Forwarded-For %[src]
http-request add-header X-Client-IP %[src]
http-request add-header Upgrade "websocket"
http-request add-header Connection "upgrade"
server awx-stack_awxweb awx-stack_awxweb:8052
It is very similar to a bunch of other services in the swarm. As far as I can understand, the upstream referenced in the message above refers to daphne inside the awx_web container, that daphne instance is listening on the http://127.0.0.1:8051 and is "called" by the proxy configuration of the nginx, also running inside the same container. I am currently investigating how can one troubleshoot daphne. I would appreciate if anyone can help me with some ideas or guidelines to proceed with the investigations. Thanks!
I'm experiencing the same issue
ENVIRONMENT AWX version: 1.0.6.8 AWX install method: docker on linux Ansible version: 2.5.2 Operating System: Debian 9 Web Browser: Firefox/Chrome
I have the same issue either
Hi, I had the same issue and i was able to get the jobs output running this command to fix the permissions:
Since most of these comments are related to proxy configurations, I should probably mention that I have the same issue but I do not have a proxy in front of mine.
I'm experiencing the same issue as well. Initially will work fine. I noticed restarting the containers/docker resolves the issue. Will monitor it to determine if issue occurs again, which I assume it will.
same error i use nginx with configuration similar to @anasypany
location / {
proxy_pass http://127.0.0.1:8052;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
but i'm unable to see the job
@cavamagie
ENVIRONMENT
cat awx/installer/inventory
host_port=127.0.0.1:9999
location / {
proxy_pass http://127.0.0.1:9999/;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
It works for me
@cstuart1 Do you think we can chat out of band regarding SAML setup with AWX? I've been at this for hours with no success.
Edit: I commented on #1016 with details on how to configure AWX for use with SAML auth.
Same issues @SatiricFX I have noticed the same thing: restarting the docker containers usually helps. Moreover, I am not using any proxy nor https access.
@piroux That does resolve it for us as well temporarily. Haven't found a permanent fix for it. Maybe a bug.
It appears you can swap the supervisor.conf and add verbose output to daphne:
[program:daphne]
command = /var/lib/awx/venv/awx/bin/daphne -b 127.0.0.1 -p 8051 awx.asgi:channel_layer -v 2
With this I am seeing the following behavior related to websockets from Daphne/nginx:
2018-06-27 03:18:59,295 DEBUG Upgraded connection daphne.response.XbupPxYRcS!BfsxXxiUPF to WebSocket daphne.response.XbupPxYRcS!ReBXomhGtg
RESULT 2
OKREADY
10.255.0.2 - - [27/Jun/2018:03:19:02 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:03,491 DEBUG WebSocket closed for daphne.response.XbupPxYRcS!ReBXomhGtg
2018-06-27 03:19:21,372 DEBUG Upgraded connection daphne.response.XbupPxYRcS!aPmLgJGDZd to WebSocket daphne.response.XbupPxYRcS!hTzJudfDoM
10.255.0.2 - - [27/Jun/2018:03:19:24 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:25,571 DEBUG WebSocket closed for daphne.response.XbupPxYRcS!hTzJudfDoM
2018-06-27 03:19:50,862 DEBUG Upgraded connection daphne.response.XbupPxYRcS!lnvEJzPynj to WebSocket daphne.response.XbupPxYRcS!XCyaFNijYM
10.255.0.2 - - [27/Jun/2018:03:19:53 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:53,999 DEBUG WebSocket closed for daphne.response.XbupPxYRcS!XCyaFNijYM
RESULT 2
OKREADY
This eventually logs:
2018-06-27 03:34:03,939 WARNING dropping connection to peer tcp4:127.0.0.1:34576 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
10.255.0.2 - - [27/Jun/2018:03:34:03 +0000] "GET /websocket/ HTTP/1.1" 502 575 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018/06/27 03:34:03 [error] 32#0: *147 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.255.0.2, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "localhost:8080"
2018-06-27 03:34:03,941 DEBUG WebSocket closed for daphne.response.XbupPxYRcS!gbrIRtuqeq
awx_web:1.0.6.23 here:
10.255.0.2 - - [28/Jun/2018:13:31:14 +0000] "GET /websocket/ HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36"
2018/06/28 13:31:14 [error] 25#0: *440 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.255.0.2, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "awx.prmrgt.com:80"
10.255.0.2 - - [28/Jun/2018:13:31:19 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36"
etc. websocket simply not working. All reverse proxy configuration was working before (1.0.3.29 for example). nginx config is fine:
location / {
proxy_pass http://10.20.1.100:8053/;
proxy_http_version 1.1;
proxy_set_header Host $host:$server_port;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
I appended these lines to /etc/tower/settings.py:
USE_X_FORWARDED_PORT = True
USE_X_FORWARDED_HOST = True
I found ansible/awx_web:1.0.6.11
is the latest image working fine for me (this means the websocket reverse proxy settings are fine outside the awx_web!). I hope this helps.
Please not the settings.py changes are not needed for 1.0.6.11 to work. I don't see any impact it I set those or not.
I am also facing the same issue.
ENVIRONMENT
They only workaround that is currently working for me is stopping everything and starting again the containers.
This issue does not appear to occur for a little while after redeploying AWX.
I did however notice that none of the job details from while this issue is occuring are available even after you restart. It appears as though the "stdout" response on the API is populated via the task container posting data to a websocket for that job.
I also noticed that when the issue is occurring that the task container fails with the following errors:
[2018-07-02 19:03:47,717: DEBUG/Worker-4] using channel_id: 2
2018-07-02 19:03:47,718 ERROR awx.main.models.unified_jobs job 15 (running) failed to emit channel msg about status change
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py", line 1169, in _websocket_emit_status
emit_channel_notification('jobs-status_changed', status_data)
File "/usr/lib/python2.7/site-packages/awx/main/consumers.py", line 70, in emit_channel_notification
Group(group).send({"text": json.dumps(payload, cls=DjangoJSONEncoder)})
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/channels/channel.py", line 88, in send
self.channel_layer.send_group(self.name, content)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 190, in send_group
self.send(channel, message)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 95, in send
self.recover()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 77, in recover
self.tdata.consumer.revive(self.tdata.connection.channel())
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/connection.py", line 255, in channel
chan = self.transport.create_channel(self.connection)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 92, in create_channel
return connection.channel()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/connection.py", line 282, in channel
return self.Channel(self, channel_id)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py", line 101, in __init__
self._x_open()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py", line 427, in _x_open
self._send_method((20, 10), args)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
self.channel_id, method_sig, args, content,
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
write_frame(1, channel, payload)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
frame_type, channel, size, payload, 0xce,
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 104] Connection reset by peer
This would explain why the job details from jobs that ran while the websockets are not working arent even visible after restarting the web/task container and why they arent available when hitting the stdout resource on the job endpoint
I ran into this issue as well and resolved it by stopping both the web and task containers and rerunning the installer playbook to start them again.
we have the issue with 1.0.6.0 and not recovering after deleting/recreating pods for awx and etcd
restarting web/task on one dev host where i was testing directly fixed it.
In Prod i'm facing issues with websocket errors behind custom reverse-proxies - Is it possible via some header hack to disable websocket completely or is that a hard requirement for awx - some libraries have fallback options ?
Decided to take a look at the rabbitmq logs and when websockets stops working I start seeing the following in the logs:
2018-07-07 00:56:02.000 [warning] <0.5148.0> closing AMQP connection <0.5148.0> (10.0.0.6:54140 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.001 [warning] <0.5138.0> closing AMQP connection <0.5138.0> (10.0.0.6:54138 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.001 [warning] <0.4690.0> closing AMQP connection <0.4690.0> (10.0.0.6:53950 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.055 [warning] <0.5182.0> closing AMQP connection <0.5182.0> (10.0.0.6:54150 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.056 [warning] <0.5172.0> closing AMQP connection <0.5172.0> (10.0.0.6:54148 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.057 [warning] <0.4731.0> closing AMQP connection <0.4731.0> (10.0.0.6:53974 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-07 00:56:02.058 [warning] <0.5192.0> closing AMQP connection <0.5192.0> (10.0.0.6:54198 -> 10.0.0.12:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
We're getting the following error everytime we try to click on a job, both running and ones that have already been completed.
WebSocket connection to 'wss://{redacted}/websocket/' failed: WebSocket is closed before the connection is established.
We experienced this both on the latest AWX Web version and on several older revisions. ansible/awx_web:1.0.6.11
in particular was what we tried.
It's worth noting this container sits behind a reverse nginx proxy, but we've tried narrowing this down by removing the proxy all together and still are getting the same errors/issue. We use this very heavily in production, are there any short-term fixes? Container reboots sometimes work for a few minutes, but typically fall back to the same errors.
Logs on AWX Web don't show anything overly useful, and likewise with postgres and task containers. RabbitMQ does show similar results as stated above.
2018-07-09 12:29:48.398 [warning] <0.11522.5> closing AMQP connection <0.11522.5> (10.0.5.240:40382 -> 10.0.5.234:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-09 12:29:48.398 [warning] <0.17632.5> closing AMQP connection <0.17632.5> (10.0.5.240:46896 -> 10.0.5.234:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
2018-07-09 12:29:48.399 [warning] <0.23641.5> closing AMQP connection <0.23641.5> (10.0.5.240:53386 -> 10.0.5.234:5672, vhost: 'awx', user: 'guest'):
client unexpectedly closed TCP connection
Seeing this as well with AWX 1.0.6.25 and Asnible 2.6.1.
EDIT: 1.0.6.1 also seems to not work.
Any page requested like this never completely loads and is blank: https://awx/jobs/playbook/8
Playbooks do actually run (and sometimes fail) which works fine for notifications.
Same behavior, but not seeing any of the errors others. Also, restarting the pod doesn't fix the issue for any amount of time. It looks like I'm just being sent back to the jobs list page.
10.32.5.17 - - [12/Jul/2018:15:50:50 +0000] "PROXY TCP4 10.32.44.94 10.32.44.94 41275 32132" 400 173 "-" "-"
[pid: 37|app: 0|req: 77/525] 10.244.8.0 () {48 vars in 3205 bytes} [Thu Jul 12 15:50:51 2018] GET /api/v2/inventory_updates/9/ => generated 4586 bytes in 104 msecs (HTTP/1.1 200) 8 headers in 248 bytes (1 switches on core 0)
10.244.8.0 - - [12/Jul/2018:15:50:51 +0000] "GET /api/v2/inventory_updates/9/ HTTP/1.1" 200 4586 "https://awx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
10.244.6.0 - - [12/Jul/2018:15:50:51 +0000] "OPTIONS /api/v2/inventory_updates/9/ HTTP/1.1" 200 11892 "https://awx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
[pid: 33|app: 0|req: 238/526] 10.244.6.0 () {50 vars in 3249 bytes} [Thu Jul 12 15:50:51 2018] OPTIONS /api/v2/inventory_updates/9/ => generated 11892 bytes in 149 msecs (HTTP/1.1 200) 8 headers in 249 bytes (1 switches on core 0)
10.244.10.0 - - [12/Jul/2018:15:50:51 +0000] "GET /api/v2/inventory_updates/9/events/?order_by=start_line&page=1&page_size=50 HTTP/1.1" 200 17126 "https://awx/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
[pid: 36|app: 0|req: 123/527] 10.244.10.0 () {48 vars in 3299 bytes} [Thu Jul 12 15:50:51 2018] GET /api/v2/inventory_updates/9/events/?order_by=start_line&page=1&page_size=50 => generated 17126 bytes in 90 msecs (HTTP/1.1 200) 9 headers in 264 bytes (1 switches on core 0)
AWX 1.0.6.17 Ansible 2.5.5 running on Kubernetes
@Borrelworst Hey friend, would you be able to paste your entire nginx.conf file? I am having the exact same issue but adding the stanza above did not fix my issue.
This is mine fwiw. `#user awx;
worker_processes 1;
pid /tmp/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
sendfile on;
#tcp_nopush on;
#gzip on;
upstream uwsgi {
server 127.0.0.1:8050;
}
upstream daphne {
server 127.0.0.1:8051;
}
server {
listen 8052 default_server;
# If you have a domain name, this is where to add it
server_name _;
keepalive_timeout 65;
# HSTS (ngx_http_headers_module is required) (15768000 seconds = 6 months)
add_header Strict-Transport-Security max-age=15768000;
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
location /static/ {
alias /var/lib/awx/public/static/;
}
location /favicon.ico { alias /var/lib/awx/public/static/favicon.ico; }
location ~ ^/(websocket|network_ui/topology/) {
# Pass request to the upstream alias
proxy_pass http://daphne;
# Require http version 1.1 to allow for upgrade requests
proxy_http_version 1.1;
# We want proxy_buffering off for proxying to websockets.
proxy_buffering off;
# http://en.wikipedia.org/wiki/X-Forwarded-For
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# enable this if you use HTTPS:
proxy_set_header X-Forwarded-Proto https;
# pass the Host: header from the client for the sake of redirects
proxy_set_header Host $http_host;
# We've set the Host header, so we don't need Nginx to muddle
# about with redirects
proxy_redirect off;
# Depending on the request value, set the Upgrade and
# connection headers
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
}
location / {
# Add trailing / if missing
rewrite ^(.*)$http_host(.*[^/])$ $1$http_host$2/ permanent;
uwsgi_read_timeout 120s;
uwsgi_pass uwsgi;
include /etc/nginx/uwsgi_params;
}
}
}`
PSA: If anyone here is using Docker SWARM and having these issues, try to run the same stack just using docker-compose (non-swarm v2), and see if you have the same issues.
The issues in this thread were all symptoms we were seeing whilst running in Swarm mode. Once we switched to local instances (docker-compose), we haven't had any issues running AWX behind an Nginx Proxy (specifically Jwilder's with custom SSL Certificates).
Just wanted to toss this tidbit out there. RedHat/AWX team has specifically stated AWX is NOT swarm supported, but I know it makes sense for a lot of people to use Swarm.
@anthonyloukinas, I'm not in swam and using docker-compose and it doesn't display job status properly at all.
@hitmenow Here bellow is my server block, I left the original congifuration intact, but just created a conf file in conf.d:
server {
ssl on;
listen 443 ssl default_server;
server_name <servername>;
ssl_certificate <certfile>;
ssl_certificate_key <keyfile>;
proxy_set_header X-Forwarded-For $remote_addr;
include /etc/nginx/default.d/*.conf;
location / {
proxy_pass http://localhost:80/;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
error_page 404 /404.html;
location = /40x.html {
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
}
}
It does work for me most of the time, but occasionally I have to restart docker to fix the issue again. The fact that so many people have the same issue tells me that or the documentation is not sufficient, or there is really a bug in the software causing the issue.
@anthonyloukinas I'm not sure RedHat provides any support for AWX so it not being supported by RedHat isn't a huge deal -- we are just hoping for some help from the team to figure out what is causing this in the scenarios it's occurring in (with and without swarm) so we can contribute an open-source fix -- nobody seems to be providing any guidance or insight, which is understandable, but in my opinion we should keep collecting more information here.
What I've noticed is once websockets stop working, subsequent attempts at the websocket opening handshake never complete. Running tcpdump on the web container on port 8051 shows web never sends out the accept-upgrade response.
I've traced the websocket connect request path and it's kind of messy. A websocket request gets handled by web but web defers responding to the handshake. Instead what happens is web creates a message on rabbitmq that a websocket connect was received. Task then picks up this message, puts a message back on rabbitmq with the contents {"accept": True}, and once web receives this message it sends out the handshake response to the client, successfully establishing a websocket connection.
What seems to be happening is that, at some point, there is a mismatch between the channels where web and task look for and place their messages (i.e. web listens for accept messages on channel A but task is sending those messages on channel B). Restarting the supervisor deamons on web and task at the same time (and other workarounds) seem to fix the issue but only temporarily. I'm also not sure why web isn't handling the websocket handshake response itself.
Full disclosure, I've only been running into these problems when deploying AWX in a swarm environment where each container has no replicas. It looks like something about swarm is causing the channels used for communication b/t web and task to de-synchronize.
Thank you @Borrelworst! I have a different scenario than you I think. I have a load balancer in front of my containers which has SSL termination. And my nginx server is listening on 8052. Will do some more troubleshooting. Thanks again
I resolved when set the endpoint_mode
of RabbitMQ to dnsrr
in the Docker Swarm Mode.
The rabbitmq stack in compsoe file is:
rabbitmq:
image: rabbitmq:3
deploy:
replicas: 1
restart_policy:
condition: on-failure
endpoint_mode: dnsrr
environment:
RABBITMQ_DEFAULT_VHOST: "awx"
networks:
- webnet
Switching to dnsrr instead of VIP kind of implies that it's an issue with the VIP timing out the idle connection --
https://github.com/moby/moby/issues/37466#issuecomment-405307656 https://success.docker.com/article/ipvs-connection-timeout-issue
This would match with the described behavior where it works initially and then at some undefined later time (relatively quickly) it stops working.
@sightseeker Is there an equivalent that you know of for Kubernetes deployments?
Thankyou @strawgate !
When I set tcp_keepalive_timeout
to less than 900 secs and using vip mode, the problem no longer occurs.
@hitmenow I haven't tried yet with K8s.
It would also imply that switching the containers to using tasks.rabbitmq to hit rabbitmq would fix the issue as that bypasses the VIP too. Will test and report back
@hitmenow Kubernetes doesnt use VIP or swarm networking so dnsrr is probably not related to your issue.
I'm running AWX in pure docker containers on the same machine (no swarm or k8s) and I was hitting this issue too.
Setting net.ipv4.tcp_keepalive_time=600
helped me as well, but it needs to be set before daphne runs, so it should be put into /etc/sysctl.conf
on the host system or similar.
I just updated the tcp_keepalive in my staging and production environment. I will check if this solution helps to the issue.
I have the same issue either
ENVIRONMENT
AWX version: 1.0.7 AWX install method: docker on linux Ansible version: 2.5.4 Operating System: CentOS 7 Web Browser: Firefox/Chrome
I have this issue as well. I was on 1.0.4.50 and that was working fine. I've moved up to 1.0.7.0 and now I just see a spinning 'working' wheel when try to see job history. I've tried different browsers and incognito windows but no change.
I'm running AWX just on normal docker. Not on k8s or openshift.
I was using haproxy in front for SSL offload but I still see the same if I browse to the awx_web container on its exposed web port (8052)
grahamneville - do you have any container logs we can take a look at?
@jakemcdermott
I've tried a few things, listed below, that people have suggested fixed the issue and some more but I've had no luck.
host_port=127.0.0.1:port
in the inventory file for exposing the port in awx_web/etc/tower/settings
to have 'USE_X_FORWARDED_PORT = Trueand
USE_X_FORWARDED_HOST = True` which I baked in to a new buildnet.ipv4.tcp_keepalive_time
to net.ipv4.tcp_keepalive_time=600
and restarted the docker service on the host and restarted all containerschmod 744 -R /opt/awx/embedded
- /opt/awx/embedded
doesn't exist on the containers2d4fbffb919884a8f9fb6ba690756cefd61929c7
These are the logs I see from the awx_web container, I'm not seeing anything coming through at the same time on any of the other containers.
[pid: 138|app: 0|req: 29/440] 1.1.1.1 () {50 vars in 2485 bytes} [Fri Aug 17 08:16:22 2018] OPTIONS /api/v2/jobs/744/ => generated 12949 bytes in 216 msecs (HTTP/1.1 200) 10 headers in 387 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "OPTIONS /api/v2/jobs/744/ HTTP/1.1" 200 12949 "https://ourawxhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" "2.2.2.2"
[pid: 136|app: 0|req: 258/441] 1.1.1.1 () {48 vars in 2447 bytes} [Fri Aug 17 08:16:22 2018] GET /api/v2/jobs/744/ => generated 9971 bytes in 237 msecs (HTTP/1.1 200) 10 headers in 386 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "GET /api/v2/jobs/744/ HTTP/1.1" 200 9971 "https://ourawxhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" "2.2.2.2"
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "GET /api/v2/jobs/744/job_events/?order_by=-counter&page=1&page_size=50 HTTP/1.1" 200 62930 "https://ourawxhost/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" "2.2.2.2"
[pid: 135|app: 0|req: 29/442] 1.1.1.1 () {48 vars in 2544 bytes} [Fri Aug 17 08:16:22 2018] GET /api/v2/jobs/744/job_events/?order_by=-counter&page=1&page_size=50 => generated 62930 bytes in 415 msecs (HTTP/1.1 200) 11 headers in 402 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:22 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 259/443] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:22 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:24 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 260/444] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:24 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:26 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 261/445] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:26 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:28 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 262/446] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:28 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:30 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 137|app: 0|req: 84/447] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:30 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:32 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 263/448] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:32 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
1.1.1.1 - - [17/Aug/2018:08:16:34 +0000] "HEAD / HTTP/1.1" 200 0 "-" "-" "-"
[pid: 136|app: 0|req: 264/449] 1.1.1.1 () {28 vars in 291 bytes} [Fri Aug 17 08:16:34 2018] HEAD / => generated 11339 bytes in 24 msecs (HTTP/1.1 200) 5 headers in 161 bytes (1 switches on core 0)
It's just the job details/history view that's a problem and the fact you don't get to see the job running in real time when you launch a new job, every other page loads fine. This is one of the URLs that I'm trying to get to, as seen when clicking on the job in the jobs view: https://ourawxhost/#/jobs/playbook/750?job_search=page_size%3A20%3Border_by%3A-finished%3Bnot__launch_type%3Async
ISSUE TYPE
COMPONENT NAME
SUMMARY
Job details and Job view not working properly
ENVIRONMENT
STEPS TO REPRODUCE
Run any playbook, failed and succeeded jobs are present but not showing any details.
EXPECTED RESULTS
Details from jobs
ACTUAL RESULTS
Nothing is showing, no errors, no timeouts, just nothing
ADDITIONAL INFORMATION
For example I have a failed job. When clicking on details, I can see the URL changing to: https://awx-url/#/jobz/project/
However nothing happens. When using right mouse button and opening in new tab/page I will only get the navigation pane and a blank page.
Same happens when I click on the job it self.
Additionaly, adding inventory sources works fine, however when navigating to 'Schedule inventory sync' I can see the the gear-wheel spinning but also nothing happens. I did a fresh installation today (9th May)