Closed fyc1007261 closed 5 years ago
Hi @lastweek ! I have looked into the source code and found that LegoOS checks all the ports of IB NIC. If there is one using Ethernet, it panics. However, the IB-supported NIC on CloudLab has two ports, one IB and one Ethernet, which is configured permanently and may not be changed by software. Is it possible to use only the IB port to do the RDMA? If so, could you please give some instructions of modifying the Lego source code? Thanks a lot!
Hi @fyc1007261,
Sorry for the inconvenience. The current driver does not support RoCE, thus once RoCE is detected, it will simply panic. (To be precise, I'm not sure if it can work on RoCE. I forgot if I omit some code regarding RoCE in mlx4.)
The error message from make install is fine. As long as the kernel image is installed at /boot
, and you can find it on the grub menu.
"Please wait for enough IB MAD (number 7)" means both machine are waiting for the MAD control messages from Infiniband switch. For 1P-1M configuration, you need to have a IB switch, and both machines are connected to the switch. What's the configuration you are using?
Hi @lastweek , Thanks for your reply! I looked into the source code of LegoOS and found that if the driver finds that any of the ports of Infiniband-supporting NIC is using Ethernet, then it panics. However, the NIC on CloudLab has 2 ports with one using IB and the other using Ethernet. I may try to modify some source code of LegoOS to let it think there is only one port and use that port only. Is there anything that I should pay attention to?
For the IB MAD problem, I am using another IB-supporting NIC that is not Mellanox but Qlogic QLE instead. I suspect that it might not be supported by the driver...?
Thanks so much for your help!
Hi @fyc1007261,
It's a LegoOS bug indeed. You should try the approach you proposed. You should pay attention to the port number, make sure you are using the IB port.
About the Qlogic QLE machine, are you running LegoOS on top of that? I don't think mlx driver can run with that.. Anyhow, can you tell me more about your hardware setup? Thanks.
Hi @lastweek ,
It is true that the driver does not support QLE NICs. I am now trying with 1P-1M settings with Mellanox MX354A NIC and SX6036G/U1 IB switches. (Melanox IS5035 is not provided on CloudLab)
Hi @lastweek ,
I finally succeeded in deploying with the 1P-1M configurations by hard-coded all the num_ports
variables to 1. Thanks a lot for you help!
Cool!!! Would you mind share your solutions with us? Being able to run on CloudLab is a big deal!!
Hi @lastweek ,
For 1P-1M settings, I used the r320 hardware in Apt Utah with CentOS 7 image and simply connect 3 raw PCs together (though only 2 are used currently). Then modify the code in the drivers/
directory to hard code all num_ports
variables to 1 (because one of the r320
NIC uses Ethernet). After this, just follow the instructions you provided on the GitHub.
As for the Storage node, there might me some problems with the CentOS image on CloudLab that I cannot install Linux 3.11.1 on it so far. I will keep trying on it.
Cool!! Let me know if you have issues installing 3.11.1. A very concise instruction is: 1) Download 3.11.1 from kernel.org. 2) copy /boot/config-3.10.xxx (the default config) into linux-3.11.1/, 3) make oldconfig
, 4) make modules_install && make install, 5) reboot into 3.11.1
Let me know how it goes!
Hi @lastweek ,
There is still something wrong with my CentOS or my 3.11.1 kernel so that I failed the intsall the 3.11.1.
Is it possible to use higher stable versions such as 3.16.70? I found there are some differences between the kernel code which the linux-modules
is using. I plan to modify some implementations in linux-modules
to fit the 3.16.70
version. Did the newer kernel just modify some interface or that the newer kernel has changed some important code inside that may lead to the failure of LegoOS's storage node? In other words, is it possible that my plan will work?
Thanks a lot for your help!
Hi @fyc1007261,
That might work, I've done similar things (port some old RDMA code to 4.x kernel). That time I changed some protection domain and some other stuff. However, this might be time-consuming and error-prone. Before you proceed, can you share more details on the installation failure? e.g., panic messages
Hi @lastweek ,
I use the cp /path/to/oldconfig .config
-> make oldconfig (default for new configurations)
-> make
-> make modules_install
-> make install
steps. The 3.11.1 kernel just didn't show anything after I type enter
to select 3.11.1 at the boot loader. I also tried same steps in my VMWare and got the same results. The VMWare monitor says that the CPU of client OS has been disabled and I cannot figure out where the problem is.
Have you ever met such problems or could you please give some suggestions? Thanks!
About your /path/to/oldconfig
, which kernel version is it?
It is 3.10.0-957.12.2.el7.x86_64, which is the default version for CentOS 7 on CloudLab
Hi @fyc1007261, I uploaded an old config file from our machine. Though the machine is different, do you wanna give it a try?
Thanks so much! I will try it soon and report to you later.
Hi @lastweek ,
Unfortunately, your config still won't work :( I may try QEMU to find out what's wrong inside the kernel.
By the way, I tried running storage node with 3.16.70, but the processor monitor panicked with fatal exception
, saying BUG: unable to handle kernel paging request at ffff880439d28b10
. Might that be an error caused by the difference between the two kernel versions?
Thanks a lot for your support!
Hi @lastweek ,
Thanks to your previous help, I have succeeded on deploying 1P-1M-1S on Ubuntu 14.04 with 3.11.1 kernel now! I am now able to run some simple python scripts. It works quite well for printing messages, using Python original modules (like time
, copy
, etc.) and using local modules (another Python script in the same folder). But when it comes to import external modules (I tried numpy
), the processor monitor panics, saying unable to handle kernel paging request at <some address>
. I wonder if LegoOS does support external modules. I am now using Python 2.7 with pip
19.2.1 and numpy 1.16.4. Numpy
was installed via pip
. Could you please kindly give some suggestions?
Thanks again for you patience!
Thanks very much for your quick reply on my last issue!
I have been continually trying to build an 1P-1M kernel but it panics saying: "not syncing - no RoCE".
I am quite sure that I am using the identical Infiniband NIC and the Infiniband works quite well that two machines are able to ping each other via Infiniband.
There is also some problems when
make install
saying "Your kernel headers for kernel 4.0.0-lego+ cannot be found at /lib/modules/4.0.0-lego+/build or /lib/modules/4.0.0-lego+/source." Should it be normal or is it really an error? (I notice that modules are disabled in LegoOS so should these directories really exist?)If not, where do you suggest might be the problems are?
By the way, I also try installing LegoOS on two other machines with different Infiniband NIC but exactly the same software configurations. It is not identical with what is recommended. But these two kernels do not panic. They stuck when "Please wait for enough IB MAD (number 7)" but fail to continue. (At least they do not panic) Can different Infiniband NIC cause infinite waiting?
I would really appreciate it if you can help!! Thanks again!