Xilinx-CNS / onload

OpenOnload high performance user-level network stack
Other
552 stars 90 forks source link

ERROR: user/driver cplane interface mismatch #103

Closed wrfeewrtgaqwfrwaq closed 1 year ago

wrfeewrtgaqwfrwaq commented 1 year ago

So I'm trying to use onload with AF_XDP on aws bare metal instance. I installed centos, updated kernel to 5.4, updated ena driver to 2.8.0, downloaded onload and build it. I used simple ping-pong c++ application to test connection between 2 machines. When I used it without onload everything works as expected. When I add LD_PRELOAD i get:

oo:duplex[25737]: netif_tcp_helper_alloc_u: ERROR: Failed to allocate stack (rc=-2)
See kernel messages in dmesg or /var/log/syslog for more details of this failure
oo:duplex[25737]: __citp_netif_alloc: failed to construct netif (2)
oo:duplex[25737]: citp_netif_alloc_and_init: failed to create netif (2)

in dmesg I see:

[67800.192999] [onload] ERROR: user/driver cplane interface mismatch
[67800.193413] [onload]   user-interface: fddb61183a3957c1aafa7d5242077adf
[67800.193921] [onload]   driver-interface: cd7cb505818242843e1df4672241e3a0
[67800.196608] [onload] [7]: RX timestamping not supported on given interface (eth0)

What might be the cause of this issue?

rhughes-xilinx commented 1 year ago

This happens when the userspace components and the kernel components were built from different source trees. My first guess would be that you changed some code (e.g. git pull) and then rebuilt everything but forgot to unload&reload the newly-rebuilt drivers.

wrfeewrtgaqwfrwaq commented 1 year ago

Just to be sure that's not the case I pulled newest version, compiled as per developement instructions, restarted machines and reloaded

Solarflare driverlink driver v5.3.13.1001 API v33.0
Solarflare NET driver v5.3.13.1001

From log it seems as if both versions are the same now.

when writing to/sys/module/sfc_resource/afxdp/register I got cp_set_hwport_xdp_prog_id: failed to notify about XDP program change, ifindex=3 rc=-1 but only on one host. Running test application on this host doesn't work - networking packets never arrive at the destination.

On the other machine that initialized successfully when I'm running application i still get the same error with version mismatch (despite fresh build).

rhughes-xilinx commented 1 year ago

Let's ignore the failed to notify about XDP program change for now - it's not a critical feature (it's probably due to some permissions thing we didn't think of).

Can you post the commands you're using to do the rebuild? I expect you to be using the stuff under Building directly from repository from DEVELOPING.md

wrfeewrtgaqwfrwaq commented 1 year ago

I'm doing git pull`` followed by 5 commands fromBuilding directly from repositoryunderDEVELOPING.md```.

I'm compiling with gcc 9.2.0 if it matters, as root

rhughes-xilinx commented 1 year ago

Hmm, I'm stuck. The version check is done by the rule at the bottom of mk/site/cplane.mk, i.e. it's just MD5ing a bunch of header files. To debug this you need to check the two generated files in build/gnu_x86_64/cp_intf_ver.h (userspace) and build/x86_64_linux-$KVER/cp_intf_ver.h (kernelspace) and I guess compare them to manually doing the cat $(CP_INTF_HDRS) | md5sum. Which doesn't match the others?

wrfeewrtgaqwfrwaq commented 1 year ago

So I diff'ed files and they're exactly the same

rhughes-xilinx commented 1 year ago

In that case I'm back to my 'are you absolutely sure you reloaded the drivers?' statement. If you run strings (or even just grep) on build/gnu_x86_64/lib/transport/unix/libcitransport0.so and build/x86_64_linux-$KVER/driver/linux/onload.ko then they should contain the literal string of the MD5 (as hex chars) that they came from. There's unfortunately no such easy way to check precisely what's currently loaded in to the kernel, so just do a manual rmmod onload;insmod build/x86_64_linux-$KVER/driver/linux/onload.ko to be sure you've got what you think you've got

wrfeewrtgaqwfrwaq commented 1 year ago

Every file contain multiple MD5-like strings to complicate things a bit but here're my results:

strings build/gnu_x86_64/lib/transport/unix/libcitransport0.so | egrep '^[0-9a-f]{8}*$'
f7e645a13529efc1f9435b4ce35b061f
60fc9a2c9ff868b5a8048e3a9ed72b10
1518b4f7ec6834a578c7a807736097ce
cd7cb505818242843e1df4672241e3a0
strings build/x86_64_linux-5.4.214-1.el7.elrepo.x86_64/driver/linux/onload.ko | egrep '^[0-9a-f]{8}*$'
f7e645a13529efc1f9435b4ce35b061f
f7e645a13529efc1f9435b4ce35b061f
cd7cb505818242843e1df4672241e3a0
60fc9a2c9ff868b5a8048e3a9ed72b10
cd7cb505818242843e1df4672241e3a0
dmesg | grep ERROR -A 3
[ 6924.169963] [onload] ERROR: user/driver cplane interface mismatch
[ 6924.170387] [onload]   user-interface: fddb61183a3957c1aafa7d5242077adf
[ 6924.170831] [onload]   driver-interface: cd7cb505818242843e1df4672241e3a0

So it seems as if I'm using user interface from old compilation somehow?

wrfeewrtgaqwfrwaq commented 1 year ago

But my LD_PRELOAD is exactly the file I'm testing: LD_PRELOAD=/home/centos/onload/build/gnu_x86_64/lib/transport/unix/libcitransport0.so

rhughes-xilinx commented 1 year ago

Ah, the app you're accelerating is probably not the thing that's complaining - it's likely to be onload_cp_server. Kill that process to ensure it gets restarted, and check the value of /sys/module/onload/parameters/cplane_server_path to see what it's going to run. You might have a weird copy lying around, especially if you loaded the drivers in some manual way rather than using load.sh.

wrfeewrtgaqwfrwaq commented 1 year ago

So I think I found the problem.

/sys/module/onload/parameters/cplane_server_path

points to broken file path

/home/centos/onload/build/gnu_/tools/cplane/onload_cp_server

I've seen it before when trying to:

LD_PRELOAD="$(mmaketool --toppath)/build/$(mmaketool --userbuild)/lib/transport/unix/libcitransport0.so"

but I thought it's a mistake and corrected path by hand. Is there a way to fix mmaketoolto generate correct path?

rhughes-xilinx commented 1 year ago

Looks like that's coming from line 93: gcc -dumpmachine | sed s/-.*$// | sed s/powerpc/ppc/. Can you see any reason that wouldn't output x86_64 like it does on my box?

wrfeewrtgaqwfrwaq commented 1 year ago

lack of gcc in PATH for root. My mistake. Now after adding it to path, reloading, killing onload_cp_server it doesn't generate error anymore.

Unfortunately networking still doesn't go through.

Spawned daemon process 4032
[13310.928610] onload_cp_server[4032]: Onload Control Plane server 96fe67b 2022-09-23 master  started: id 1, pid 4032
[13310.929507] onload_cp_server[4032]: Accelerating eth0: RX 1 TX 1
[13310.929944] onload_cp_server[4032]: cp_set_hwport_xdp_prog_id: failed to notify about XDP program change, ifindex=3 rc=-1
[13310.930706] onload_cp_server[4032]: cp_set_hwport_xdp_prog_id: failed to notify about XDP program change, ifindex=3 rc=-1
[13310.933387] [sfc efrm] efrm_vi_rm_delayed_free: 000000006543e609
[13310.934532] [onload] [1]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support.
[13311.058425] ena 0000:7e:00.0 eth0: Command parameter 46 is not supported
[13311.058881] [sfc efhw] rxclass_get_dev_info: rxclass: Cannot get RX class rule count
[13311.059400] [onload] oof_local_port_addr_fixup_wild: 1:2047 ERROR: FILTER TCP 172.31.0.132:45907 0.0.0.0:0 failed (-95)

seems there's some problem with registering XDP rules now

rhughes-xilinx commented 1 year ago

At that point you leave the area of my knowledge - it'll be something NIC-specific (typically something that's not supported by the NIC you're using) and I've never played with the ena NICs.

wrfeewrtgaqwfrwaq commented 1 year ago

thanks for your help. I'll fight with NIC now then