WukLab / LegoOS

Disseminated, Distributed OS for Hardware Resource Disaggregation. USENIX OSDI 2018 Best Paper.
http://LegoOS.io
GNU General Public License v2.0
487 stars 73 forks source link

Kernel panics saying: "not syncing - no RoCE" while another cluster infinitely "wait for enough IB MAD (number 7)" #8

Closed fyc1007261 closed 5 years ago

fyc1007261 commented 5 years ago

Thanks very much for your quick reply on my last issue!

I have been continually trying to build an 1P-1M kernel but it panics saying: "not syncing - no RoCE".

I am quite sure that I am using the identical Infiniband NIC and the Infiniband works quite well that two machines are able to ping each other via Infiniband.

There is also some problems when make install saying "Your kernel headers for kernel 4.0.0-lego+ cannot be found at /lib/modules/4.0.0-lego+/build or /lib/modules/4.0.0-lego+/source." Should it be normal or is it really an error? (I notice that modules are disabled in LegoOS so should these directories really exist?)

If not, where do you suggest might be the problems are?

By the way, I also try installing LegoOS on two other machines with different Infiniband NIC but exactly the same software configurations. It is not identical with what is recommended. But these two kernels do not panic. They stuck when "Please wait for enough IB MAD (number 7)" but fail to continue. (At least they do not panic) Can different Infiniband NIC cause infinite waiting?

I would really appreciate it if you can help!! Thanks again!

fyc1007261 commented 5 years ago

Hi @lastweek ! I have looked into the source code and found that LegoOS checks all the ports of IB NIC. If there is one using Ethernet, it panics. However, the IB-supported NIC on CloudLab has two ports, one IB and one Ethernet, which is configured permanently and may not be changed by software. Is it possible to use only the IB port to do the RDMA? If so, could you please give some instructions of modifying the Lego source code? Thanks a lot!

lastweek commented 5 years ago

Hi @fyc1007261,

Sorry for the inconvenience. The current driver does not support RoCE, thus once RoCE is detected, it will simply panic. (To be precise, I'm not sure if it can work on RoCE. I forgot if I omit some code regarding RoCE in mlx4.)

The error message from make install is fine. As long as the kernel image is installed at /boot, and you can find it on the grub menu.

"Please wait for enough IB MAD (number 7)" means both machine are waiting for the MAD control messages from Infiniband switch. For 1P-1M configuration, you need to have a IB switch, and both machines are connected to the switch. What's the configuration you are using?

fyc1007261 commented 5 years ago

Hi @lastweek , Thanks for your reply! I looked into the source code of LegoOS and found that if the driver finds that any of the ports of Infiniband-supporting NIC is using Ethernet, then it panics. However, the NIC on CloudLab has 2 ports with one using IB and the other using Ethernet. I may try to modify some source code of LegoOS to let it think there is only one port and use that port only. Is there anything that I should pay attention to?

For the IB MAD problem, I am using another IB-supporting NIC that is not Mellanox but Qlogic QLE instead. I suspect that it might not be supported by the driver...?

Thanks so much for your help!

lastweek commented 5 years ago

Hi @fyc1007261,

It's a LegoOS bug indeed. You should try the approach you proposed. You should pay attention to the port number, make sure you are using the IB port.

About the Qlogic QLE machine, are you running LegoOS on top of that? I don't think mlx driver can run with that.. Anyhow, can you tell me more about your hardware setup? Thanks.

fyc1007261 commented 5 years ago

Hi @lastweek ,

It is true that the driver does not support QLE NICs. I am now trying with 1P-1M settings with Mellanox MX354A NIC and SX6036G/U1 IB switches. (Melanox IS5035 is not provided on CloudLab)

fyc1007261 commented 5 years ago

Hi @lastweek , I finally succeeded in deploying with the 1P-1M configurations by hard-coded all the num_ports variables to 1. Thanks a lot for you help!

lastweek commented 5 years ago

Cool!!! Would you mind share your solutions with us? Being able to run on CloudLab is a big deal!!

fyc1007261 commented 5 years ago

Hi @lastweek ,

For 1P-1M settings, I used the r320 hardware in Apt Utah with CentOS 7 image and simply connect 3 raw PCs together (though only 2 are used currently). Then modify the code in the drivers/ directory to hard code all num_ports variables to 1 (because one of the r320 NIC uses Ethernet). After this, just follow the instructions you provided on the GitHub.

As for the Storage node, there might me some problems with the CentOS image on CloudLab that I cannot install Linux 3.11.1 on it so far. I will keep trying on it.

lastweek commented 5 years ago

Cool!! Let me know if you have issues installing 3.11.1. A very concise instruction is: 1) Download 3.11.1 from kernel.org. 2) copy /boot/config-3.10.xxx (the default config) into linux-3.11.1/, 3) make oldconfig, 4) make modules_install && make install, 5) reboot into 3.11.1

Let me know how it goes!

fyc1007261 commented 5 years ago

Hi @lastweek ,

There is still something wrong with my CentOS or my 3.11.1 kernel so that I failed the intsall the 3.11.1.

Is it possible to use higher stable versions such as 3.16.70? I found there are some differences between the kernel code which the linux-modules is using. I plan to modify some implementations in linux-modules to fit the 3.16.70 version. Did the newer kernel just modify some interface or that the newer kernel has changed some important code inside that may lead to the failure of LegoOS's storage node? In other words, is it possible that my plan will work?

Thanks a lot for your help!

lastweek commented 5 years ago

Hi @fyc1007261,

That might work, I've done similar things (port some old RDMA code to 4.x kernel). That time I changed some protection domain and some other stuff. However, this might be time-consuming and error-prone. Before you proceed, can you share more details on the installation failure? e.g., panic messages

fyc1007261 commented 5 years ago

Hi @lastweek ,

I use the cp /path/to/oldconfig .config -> make oldconfig (default for new configurations) -> make -> make modules_install -> make install steps. The 3.11.1 kernel just didn't show anything after I type enter to select 3.11.1 at the boot loader. I also tried same steps in my VMWare and got the same results. The VMWare monitor says that the CPU of client OS has been disabled and I cannot figure out where the problem is.

Have you ever met such problems or could you please give some suggestions? Thanks!

lastweek commented 5 years ago

About your /path/to/oldconfig, which kernel version is it?

fyc1007261 commented 5 years ago

It is 3.10.0-957.12.2.el7.x86_64, which is the default version for CentOS 7 on CloudLab

lastweek commented 5 years ago

3.11.1config.txt

Hi @fyc1007261, I uploaded an old config file from our machine. Though the machine is different, do you wanna give it a try?

fyc1007261 commented 5 years ago

Thanks so much! I will try it soon and report to you later.

fyc1007261 commented 5 years ago

Hi @lastweek ,

Unfortunately, your config still won't work :( I may try QEMU to find out what's wrong inside the kernel.

By the way, I tried running storage node with 3.16.70, but the processor monitor panicked with fatal exception, saying BUG: unable to handle kernel paging request at ffff880439d28b10. Might that be an error caused by the difference between the two kernel versions?

Thanks a lot for your support!

fyc1007261 commented 5 years ago

Hi @lastweek ,

Thanks to your previous help, I have succeeded on deploying 1P-1M-1S on Ubuntu 14.04 with 3.11.1 kernel now! I am now able to run some simple python scripts. It works quite well for printing messages, using Python original modules (like time, copy, etc.) and using local modules (another Python script in the same folder). But when it comes to import external modules (I tried numpy), the processor monitor panics, saying unable to handle kernel paging request at <some address>. I wonder if LegoOS does support external modules. I am now using Python 2.7 with pip 19.2.1 and numpy 1.16.4. Numpy was installed via pip. Could you please kindly give some suggestions?

Thanks again for you patience!