Open Villux opened 4 years ago
Hi @Villux, it will be hard to debug this without any logs. Could you enable boot diagnostics before installing RDMA?
Hi @Villux, it will be hard to debug this without any logs. Could you enable boot diagnostics before installing RDMA?
I understand. I have enabled it but for some reason I don't get any. I have a ticket open on this and I will post the logs as soon as I get them. For normal VMs boot diagnostics work, but for VMSS for some reason it doesn't. I have enabled it for the VMSS and when I try to add it to the individual VMs under the scale set it says "enable it for VMSS". And that I have enabled.
Update: I have chatted with one of the engineering teams in Microsoft and they were interested to try my steps to see what is failing. Haven't heard from them yet, but I will update all the info here.
After node upgrade I was able to see the diagnostics and it seems that network can't start on the nodes. Reason is conflicting mac address with eth1
and ib0
. Most likely this has nothing to do with WALinuxAgent
.
@Villux Hi, are you able to find a solution for this? I'm facing the same issue it took me a whole day and still not able to figure out.
Not yet, I have an ongoing discussion with one of the engineering teams on this. Haven't heard from them in many weeks so not sure how actively they are working on this.
My current workaround is ansible playbooks. I just use pure DSVM image and install all stuff on the instances with ansible. Adds extra time on the setup but better than debugging this.
Thanks, have you tried other Ubuntu version? Like 16.04 LTS? and also is this issue "vm size" specific? I'm using HC44rs.
I have used only Ubuntu 18. I have tried different NC sizes, but I don't think it's in any way related to it.
I tried on Ubuntu 16.04 LTS, its working properly. I think its Ubuntu 18.04 LTS operating system specific problem.
thanks for the tip! I will have to try it as well
@nervermore2 btw here's a work around for the issue if you use Ubuntu 18 https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/hb-hc-known-issues#duplicate-mac-with-cloud-init-with-ubuntu-on-h-series-and-n-series-vms
@Villux @pgombar I don't think the solution works. I followed the instruction, and still wasn't able to launch the Image.
I had an image without RDMA. Creating VMSS with that worked normally. I installed RDMA with these instructions https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/hpc/enable-infiniband#manually-install-mellanox-ofed and created a new image. After that I'm no longer able to access the VMs (which are created with the new image). SSH just hangs and eventually times out. I have tested this couple of times and RDMA steps seems to be the one which makes accessing VMs fail.
I tried to enable boot diagnostics but couldn't get that to work. I have it on for the vmss but when I inspect boot diagnostics setup for the VMs it says that enable it for VMSS. Tried this couple of times and restarted the vms
What could I do to fix this? I need RDMA installed since I don't want to run the installation on each machine. I might potentially use close to hundred vms. Installation takes some time so even automating it is not a good solution.