Getting a "connection refused" error in the browser after running DevOps playbook on Redhat VM.

kaydanzie commented 7 years ago

Original story: https://trello.com/c/N01ALkSt

This is the error I'm seeing in Chrome: screen shot 2017-10-23 at 4 08 21 pm

When I enter the url as https://localhost:8080 or https://localhost:8443 it redirects to the Signage homepage URLhttps://localhost/users/sign_in but I get the error shown above.

No error messages found in these logs: /var/www/signage/shared/log/puma.access.log, /var/www/signage/shared/log/puma.error.log, /var/www/signage/shared/log/virtualbox.log, /var/log/nginx/access.log, /var/log/nginx/error.log.

Inside issue relating to web server config: https://github.com/chapmanu/inside/issues/851

kaydanzie commented 7 years ago

Some common errors I came across while running the new Signage playbook that builds a web server and database server.

Error:

TASK [ontic.account : Account | Configure users.]
fatal: [virtualbox]: FAILED! => {"failed": true, "msg": "[{u'files': [{u'path': u'.ssh', u'state': u'directory', u'mode': u'0700'}], u'group': u'{{ app_user }}', u'name': u'{{ app_user }}', u'createhome': True, u'sudoer': True, u'password': u\"{{ lookup('password', '{{ app_user_pw_path }} length=16 encrypt=md5_crypt') }}\"}]: An unhandled exception occurred while running the lookup plugin 'password'. Error was a <class 'ansible.errors.AnsibleError'>, original message: passlib must be installed to encrypt vars_prompt values"}

Solution: Had to run sudo pip install passlib on my machine.

Error:

TASK [Gathering Facts]
fatal: [virtualbox]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).\r\n", "unreachable": true}

Solution: Every time you restore the clean VM snapshot (wimops-ready) to rerun the playbook, you have to run ssh-copy-id -p 2222 wimops@localhost again.

Error:

TASK [Gathering Facts]
fatal: [virtualbox]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Connection closed\r\n", "unreachable": true}

Solution: This was after I ran the playbook once successfully, deleted the tmp/ansible-signage-deploy-virtualbox-pw.txt file locally, then tried to run the playbook again. I had to restore the snapshot and run the playbook from scratch to fix it. It appears like that role will only create the randomly-generated password file the first time around but every time you run the playbook after that it just does a lookup.

Error (in browser): 502 Bad Gateway Possible Solution: Puma isn't running on the VM, bundle exec puma

One time I tried to do ssh -p 2222 deploy@localhost and I got a really generic error message: ssh_exchange_identification: Connection closed by remote host. That was because right before, I had run chown -R deploy /var while logged in as root. Problem was that the /var/empty directory has to be owned by root.

kaydanzie commented 7 years ago

Tried with curl to see if I could get any more details about the error.

$ curl -v "https://localhost:80"
* Rebuilt URL to: https://localhost:80/
*   Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 80 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connection failed
* connect to 127.0.0.1 port 80 failed: Connection refused
* Failed to connect to localhost port 80: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 80: Connection refused

kaydanzie commented 7 years ago

We tested the theory that it was a firewall issue, so I reset the VM and ran the playbook from scratch without the firewall ansible role.

After deploying the app, this is what I get from curl:

$ curl -v "https://127.0.0.1:8443"
* Rebuilt URL to: https://127.0.0.1:8443/
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8443 (#0)
* WARNING: using IP address, SNI is being disabled by the OS.
* Server aborted the SSL handshake
* Closing connection 0
curl: (35) Server aborted the SSL handshake

In the browser, it just takes a really long time to load and then shows this: screen shot 2017-10-26 at 3 24 38 pm

I feel like this is a step backwards from where I was yesterday just because now the browser doesn't redirect to /users/sign_in like it was yesterday. So my plan now is to add back the firewall ansible role that I removed and find a different way to approach the issue I started with yesterday.

kaydanzie commented 7 years ago

The curl outputs after running the playbook with the firewall role versus without it is what makes me think this is kind of a port forwarding/firewall issue.

With the firewall role, the host can never connect at all to 127.0.0.1 at port 80:

* Connection failed
* connect to 127.0.0.1 port 80 failed: Connection refused

But it did when I removed the firewall role:

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8443 (#0)

tatwell commented 7 years ago

To extend Kayla's comment above, some additional playbook errors I encountered and how I resolved them:

Enable RHEL 7 Server Optional RPMs Ansible task failed

Needed to re-register server. For details, wiki.

Import Nodesource RPM key (CentOS 7+)

I hit this a couple times:

TASK [geerlingguy.nodejs : Import Nodesource RPM key (CentOS 7+)..] ****************************************************
fatal: [virtualbox]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to validate the SSL certificate for rpm.nodesource.com:443. Make sure your managed systems have a valid CA certificate installed. You can use validate_certs=False if you do not need to confirm the servers identity but this is unsafe and not recommended. Paths checked for this platform: /etc/ssl/certs, /etc/pki/ca-trust/extracted/pem, /etc/pki/tls/certs, /usr/share/ca-certificates/cacert.org, /etc/ansible. The exception msg was: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)."}

If I simply re-ran the playbook, task would pass without an error. For a possible explanation why, see this Trello card comment.

It is also possible that it's dependent on another step in playbook that installs server and that it's a sequencing issue.

VM webserver can't be reached / Firewall Issue

This again appears to be an intermittent issue that was resolved by re-running the playbook. For details, see this Trello comment.

tatwell commented 6 years ago

@kaylaziegler Can we close this ticket? We did mark the Trello card associated with complete:

https://trello.com/c/N01ALkSt/560-deploy-signage-to-red-hat-image-on-virtualbox

kaydanzie commented 6 years ago

@tatwell That's fine, you took over this story after me I think so if you're good with closing it then I am too.

tatwell commented 6 years ago

@kaylaziegler Ok, I'll close it. I'm assuming I meant to but overlooked it.

chapmanu / signage