Xilinx / RecoNIC

RecoNIC is a software/hardware shell used to enable network-attached processing within an RDMA-featured SmartNIC for scale-out computing.
MIT License
96 stars 23 forks source link

The IP version looks too old to compile on Alveo U45N/Alveo SN1022? #2

Closed PriceHuang closed 1 year ago

PriceHuang commented 1 year ago

Hi zhguanw, I am trying to put RecoNIC on Alveo U45N/SN1022. But RecoNIC's building tcls are using the older version IP, such as P4 are using v1.0 and ernic are using v3.1. While ernic v3.1cannot work on U45N's part(xcu26-vsva1365-2VL-e). That will causing some problem. FYI, I had put Open-nic on U45N, and working good.

zhguanw-amd commented 1 year ago

Hi @PriceHuang,

Thanks for your interest.

May I know which Vivado version are you using for U45N and SN1022?

Upgrading VitisNetP4 should be very simple. Upgrading ERNIC from v3.1 to v4.0 requires some efforts, as its register space has been changed. We don't have any plans to migrate ERNIC from v3.1 to v4.0 this year.

If you're willing to migrate ERNIC from v3.1 to v4.0, we are happy to guide you. Otherwise, please keep tuned.

Best regards, Guanwen

PriceHuang commented 1 year ago

Thanks for your reply! I am using vivado 2023.1 before and I changed to vivado 2021.2 now. The problems on IP version are solved. Now the problem is driver cannot work with the project. After insmod the driver I still cannot find the device by ifconfig. Can see the error code "onic_pci_probe : onic_enable_cmac () failed with -16"

Best regards, PriceHuang

zhguanw-amd commented 1 year ago

Hi @PriceHuang,

_Now the problem is driver cannot work with the project. After insmod the driver I still cannot find the device by ifconfig. Can see the error code "onic_pci_probe : onic_enablecmac () failed with -16"

"onic_enable_cmac() failed with -16" means the CMAC component is not reset properly. Are you using the same version Ubuntu and linux kernel mentioned in the repo? We only tested on Ubuntu 20.04 with linux kernel version 5.4.0-125-generic.

BTW, in your current project, are you using U45N FPGA board, instead of U250?

Best regards, Guanwen

PriceHuang commented 1 year ago

Hi @PriceHuang,

_Now the problem is driver cannot work with the project. After insmod the driver I still cannot find the device by ifconfig. Can see the error code "onic_pci_probe : onic_enablecmac () failed with -16"

"onic_enable_cmac() failed with -16" means the CMAC component is not reset properly. Are you using the same version Ubuntu and linux kernel mentioned in the repo? We only tested on Ubuntu 20.04 with linux kernel version 5.4.0-125-generic.

BTW, in your current project, are you using U45N FPGA board, instead of U250?

Best regards, Guanwen

Yes, I do run on Ubuntu-20.04 but linux kernel is 5.15.0-84-generic, and the board I am using is U45N. I just solve the problem by annotating the code in "onic_main.c" line 1084 to line 1086. And now I caught a new problem, the test case in rdma_test will fall in QP2 in FATAL problem after I try to test 128K Byte payload in SEND_RECV test.

Best regards, PriceHuang

zhguanw-amd commented 1 year ago

Hi @PriceHuang ,

Good to hear that you solved the problem.

_>And now I caught a new problem, the test case in rdma_test will fall in QP2 in FATAL problem after I try to test 128K Byte payload in SENDRECV test.

The issue is caused by "RQE_SIZE" (https://github.com/Xilinx/RecoNIC/blob/main/lib/rdma_api.h#L24), which is 256*256 = 64KB per RQ size. You can simply set it to 512 to bypass the issue. I'll update a version soon to make the rqe_size configurable.

Thanks, Guanwen

PriceHuang commented 1 year ago

Hi @PriceHuang ,

Good to hear that you solved the problem.

_>And now I caught a new problem, the test case in rdma_test will fall in QP2 in FATAL problem after I try to test 128K Byte payload in SENDRECV test.

The issue is caused by "RQE_SIZE" (https://github.com/Xilinx/RecoNIC/blob/main/lib/rdma_api.h#L24), which is 256*256 = 64KB per RQ size. You can simply set it to 512 to bypass the issue. I'll update a version soon to make the rqe_size configurable.

Thanks, Guanwen

Hi @zhguanw-amd , In later time on that night, I just reboot the host to solve the problem. I found the RecoNIC driver are running on userspace, maybe the problem is caused by memory overflow? For I cannot meet any "memory free" operation while can meet the "memory allocate"operation.

Thanks, PriceHuang

zhguanw-amd commented 1 year ago

Hi @PriceHuang ,

In send_recv.c, there are some buffers that we forgot to free, but it has nothing to do with QP fatal. If you check ./lib/*, you could find those free operations. QP fatal means some registers of ERNIC have wrong values. In your case, you are trying to send payload size more than a RQ buffer can accommodate.

We will update the send_recv example later. BTW, we'll also add hardware optimization gradually. Please keep tuned.

Thanks, Guanwen