alan-turing-institute / azure-sensible

A sensible starting point for deploying and configuring virtual machines on Azure
MIT License
4 stars 1 forks source link

Intermittent SSH issues #17

Open sgibson91 opened 3 years ago

sgibson91 commented 3 years ago

@DavidBeavan and I have seen intermittent issues with SSH into VMs. We hope we've fixed this by upgrading to the devsec.hardening collection. But leaving this issue here to track error messages if it does reoccur so we can work out what's happening.

sgibson91 commented 3 years ago

After running Ansible with the new devsec hardening collection

Received disconnect from xx.xxx.xxx.xx port 22:2: Too many authentication failures
Disconnected from xx.xxx.xxx.xx port 22

Then followed by

$ ansible-playbook -i inventory.yaml playbook.yaml

PLAY [configure vm] *********************************************************************

TASK [Gathering Facts] ******************************************************************
fatal: [vm]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Received disconnect from xx.xxx.xxx.xx port 22:2: Too many authentication failures\r\nDisconnected from xx.xxx.xxx.xx port 22", "unreachable": true}

PLAY RECAP ******************************************************************************
vm                         : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0 

this error came from a new VM which had only the Ansible Admin key and my public ssh key on it.

sgibson91 commented 3 years ago

New error on a clean VM:

$ ssh -i ~/.ssh/id_rsa.pub dev@xx.xxx.xx.xx
dev@xx.xxx.xx.xx: Permission denied (publickey).

Followed by same Ansible error as above.

JimMadge commented 3 years ago

New error on a clean VM:

$ ssh -i ~/.ssh/id_rsa.pub dev@xx.xxx.xx.xx
dev@xx.xxx.xx.xx: Permission denied (publickey).

Followed by same Ansible error as above.

This is a mistake in the README 🤦 , the argument of -i should be a valid private key. This is fixed in #18 .

JimMadge commented 3 years ago

Too many authentication failures may be related to sshd's MaxAuthTries parameter.

Related stackexchange https://unix.stackexchange.com/questions/418582/in-sshd-config-maxauthtries-limits-the-number-of-auth-failures-per-connection

sgibson91 commented 3 years ago

So this is what I've tried this morning. Note: this is running Ansible without any ssh-hardening role.

  1. Create new VM with terraform
  2. Try to run Ansible as normal and was blocked with "Operation time out" error.
  3. Destroy the VM using terraform
  4. Creat my own VM manually through Azure portal with user sgibson and my id_rsa.pub ssh key
  5. Run ssh -i ~/.ssh/id_rsa sgibson@VM_IP. Successfully access machine.
  6. Generate an inventory.yaml for Ansible with VM_IP, sgibson and path to my private ssh key
  7. Run Ansible as usual, everything is working so far.

I don't know much about PEM keys, I wonder if this is something terraform is doing that's causing the issue? Alternatively, is there something I could run that would "purge" my ssh client so I can start from scratch?

Update:

  1. Ansible successfully completed and I can now ssh into the manually created VM as both sgibson and dev with no issue, using my private ssh key.

So I wonder if the problem is terraform?

JimMadge commented 3 years ago

@sgibson91 I had a chance to try the code you sent me.

I have found that consistently,

There are two things I have fixed which aren't incorporated into your code.

First is #24 which would sometimes (but not always 😕?) lead to problems connecting as the NSG rules are not actually applied to the NIC (entirely my fault for not spotting this before 😓)

Second, looking at your code made me look into how the variables being passed to the ssh hardening role were being processed. In short, it looks like all the vars within the role block are actually global, so even though you weren't using totp, removing the totp-focused ssh hardening block actually meant you were using different variables to me. I've fixed this in #27.

As you are not interested in using TOTP, the changes in #27 are probably too verbose for your configuration. Could you try changing the ssh hardening vars to simply

sftp_enabled: true

That one helps Ansible work as it uses sftp to copy files by default. The rest were either cosmetic or just the defaults. I think that ssh_use_pam: false was causing the problem, but because of the way variables are handled (like I've said above) I never actually deployed a machine using that option!

sgibson91 commented 3 years ago

Thanks @JimMadge! I suspect this is a 2021 task for me now, but thank you for looking into it!