Azure / WALinuxAgent

Microsoft Azure Linux Guest Agent
http://azure.microsoft.com/
Apache License 2.0
541 stars 372 forks source link

[BUG] Unable to connect with SSH with VMSS after RDMA installation #1942

Open Villux opened 4 years ago

Villux commented 4 years ago

I had an image without RDMA. Creating VMSS with that worked normally. I installed RDMA with these instructions https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/enable-infiniband#manually-install-mellanox-ofed and created a new image. After that I'm no longer able to access the VMs (which are created with the new image). SSH just hangs and eventually times out. I have tested this couple of times and RDMA steps seems to be the one which makes accessing VMs fail.

I tried to enable boot diagnostics but couldn't get that to work. I have it on for the vmss but when I inspect boot diagnostics setup for the VMs it says that enable it for VMSS. Tried this couple of times and restarted the vms

waagent --version
WALinuxAgent-2.2.45 running on ubuntu 18.04
Python: 3.6.9
Goal state agent: 2.2.49.2

image

What could I do to fix this? I need RDMA installed since I don't want to run the installation on each machine. I might potentially use close to hundred vms. Installation takes some time so even automating it is not a good solution.

pgombar commented 4 years ago

Hi @Villux, it will be hard to debug this without any logs. Could you enable boot diagnostics before installing RDMA?

Villux commented 4 years ago

Hi @Villux, it will be hard to debug this without any logs. Could you enable boot diagnostics before installing RDMA?

I understand. I have enabled it but for some reason I don't get any. I have a ticket open on this and I will post the logs as soon as I get them. For normal VMs boot diagnostics work, but for VMSS for some reason it doesn't. I have enabled it for the VMSS and when I try to add it to the individual VMs under the scale set it says "enable it for VMSS". And that I have enabled.

Villux commented 4 years ago

Update: I have chatted with one of the engineering teams in Microsoft and they were interested to try my steps to see what is failing. Haven't heard from them yet, but I will update all the info here.

After node upgrade I was able to see the diagnostics and it seems that network can't start on the nodes. Reason is conflicting mac address with eth1 and ib0. Most likely this has nothing to do with WALinuxAgent.

nervermore2 commented 4 years ago

@Villux Hi, are you able to find a solution for this? I'm facing the same issue it took me a whole day and still not able to figure out.

Villux commented 4 years ago

Not yet, I have an ongoing discussion with one of the engineering teams on this. Haven't heard from them in many weeks so not sure how actively they are working on this.

My current workaround is ansible playbooks. I just use pure DSVM image and install all stuff on the instances with ansible. Adds extra time on the setup but better than debugging this.

nervermore2 commented 4 years ago

Thanks, have you tried other Ubuntu version? Like 16.04 LTS? and also is this issue "vm size" specific? I'm using HC44rs.

Villux commented 4 years ago

I have used only Ubuntu 18. I have tried different NC sizes, but I don't think it's in any way related to it.

nervermore2 commented 4 years ago

I tried on Ubuntu 16.04 LTS, its working properly. I think its Ubuntu 18.04 LTS operating system specific problem.

Villux commented 4 years ago

thanks for the tip! I will have to try it as well

Villux commented 4 years ago

@nervermore2 btw here's a work around for the issue if you use Ubuntu 18 https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/hb-hc-known-issues#duplicate-mac-with-cloud-init-with-ubuntu-on-h-series-and-n-series-vms

nervermore2 commented 4 years ago

@Villux @pgombar I don't think the solution works. I followed the instruction, and still wasn't able to launch the Image.