Xilinx-CNS / onload

OpenOnload high performance user-level network stack
Other
583 stars 95 forks source link

onload 8.1.3 segfaults with xilinx_efct 1.6.3.0 #237

Closed osresearch closed 3 months ago

osresearch commented 3 months ago

There might be version skew between the onload 8.1.3 and the xilinix_efct 1.6.3.0 driver for our X3 NICs. Using tags v8.1.1 and v8.1.2 work fine, but building eflatency from the tag v8.1.3 and running a simple pong segfaults in efct_rx_next_header():

Starting program: eflatency-8.1.3 -s 0:1024:128 -c 1500 pong ens1f0np0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
# ef_vi_version_str: 81c7484f 2024-05-28 v8.1.3 ef_vi
# NIC(s) 10 -1
# udp payload len: 0:1024:128
# iterations: 100000
# warmups: 10000
# frame len: 42
# mode: X3 CTPIO

Program received signal SIGSEGV, Segmentation fault.
efct_rx_next_header (vi=0x55555557c0c0 <rx_vi>, vi=0x55555557c0c0 <rx_vi>,
    next=196609) at /home/thudson/onload/src/lib/ciul/efct_vi.c:249
249     /home/thudson/onload/src/lib/ciul/efct_vi.c: No such file or directory.
(gdb) where
#0  efct_rx_next_header (vi=0x55555557c0c0 <rx_vi>, vi=0x55555557c0c0 <rx_vi>,
    next=196609) at /home/thudson/onload/src/lib/ciul/efct_vi.c:249
#1  efct_poll_rx (vi=vi@entry=0x55555557c0c0 <rx_vi>, qid=<optimized out>,
    evs=evs@entry=0x55555557c538 <rx_vi+1144>, evs_len=<optimized out>,
    evs_len@entry=2) at /home/thudson/onload/src/lib/ciul/efct_vi.c:899
#2  0x0000555555566f68 in efct_ef_eventq_poll (vi=0x55555557c0c0 <rx_vi>,
    evs=0x55555557c538 <rx_vi+1144>, evs_len=2)
    at /home/thudson/onload/src/lib/ciul/efct_vi.c:1075
...

I believe that changing the number of superbufs from 512 to 2048 in patch https://github.com/Xilinx-CNS/onload/commit/38130d26274ec04020d2ab08586b3225da489199 caused the problem. There was a comment that "With current data structures, the value should be left at 512", so are there some structures in the kernel driver that also need to be updated? If I revert this patch and update the header hash, onload 8.1.3 works with our X3 cards:

diff --git a/src/include/etherfabric/internal/efct_uk_api.h b/src/include/etherfabric/internal/efct_uk_api.h
index c82526b2..1d5b3c93 100644
--- a/src/include/etherfabric/internal/efct_uk_api.h
+++ b/src/include/etherfabric/internal/efct_uk_api.h
@@ -15,7 +15,7 @@

 /* Max superbufs permitted to be assigned to a single rxq, across the whole
  * system. */
-#define CI_EFCT_MAX_SUPERBUFS   2048
+#define CI_EFCT_MAX_SUPERBUFS   512

 /* As defined by the CPU architecture */
 #define CI_HUGEPAGE_SIZE   2097152
diff --git a/src/lib/ciul/efct_vi.c b/src/lib/ciul/efct_vi.c
index 1921438e..6b27639c 100644
--- a/src/lib/ciul/efct_vi.c
+++ b/src/lib/ciul/efct_vi.c
@@ -48,7 +48,7 @@ struct efct_rx_descriptor
  */

 #define PKT_ID_PKT_BITS  16
-#define PKT_ID_SBUF_BITS 11
+#define PKT_ID_SBUF_BITS  9
 #define PKT_ID_RXQ_BITS   3
 #define PKT_ID_TOTAL_BITS (PKT_ID_PKT_BITS + PKT_ID_SBUF_BITS + PKT_ID_RXQ_BITS)

diff --git a/src/lib/ciul/pt_endpoint.c b/src/lib/ciul/pt_endpoint.c
index 8b724c4b..c43ec03c 100644
--- a/src/lib/ciul/pt_endpoint.c
+++ b/src/lib/ciul/pt_endpoint.c
@@ -370,7 +370,7 @@ void ef_vi_set_intf_ver(char* intf_ver, size_t len)
    * It'd also be possible to enhance the checksum computation to be smarter
    * (e.g. by ignoring comments, etc.).
    */
-  if( strcmp(EFCH_INTF_VER, "0e632659c5a27e20c837adb613029099") ) {
+  if( strcmp(EFCH_INTF_VER, "f9c599f84c3be363a27114f34c18c638") ) {
     fprintf(stderr, "ef_vi: ERROR: char interface has changed\n");
     abort();
   }
jfeather-amd commented 3 months ago

Hi @osresearch, thanks for raising this issue.

My understanding from reading your description is that you installed an older version of onload (e.g., v8.1.2) and then built eflatency from the tag v8.1.3 and tried to run it under those conditions, is this correct? These symptoms do indeed look like a mismatch between kernel and userspace, but I expect the xilinx_efct driver isn't the cause of this, and it's instead due to an older onload kernel module (without the mentioned patch that bumps CI_EFCT_MAX_SUPERBUFS). Would you be able to verify that the onload user version and kernel version are in-sync? You should be able to see this from just running ./scripts/onload by itself to see a header with this information, for example see below:

$ ./scripts/onload
Onload 8.1.3
Copyright (c) 2002-2024 Advanced Micro Devices, Inc.
Built: Apr  8 2015 16:20:35 (debug)
Build profile header: <ci/internal/transport_config_opt_cloud.h>
Kernel module: 8.1.3
[-snip usage-]

You have correctly identified that there was a comment suggesting that data structures before that patch were unable to cope with more than 512 superbufs, but surrounding work meant that was no longer the case (cc. @ligallag-amd who did quite a bit of work in this area).

If this isn't the case, would you be able to provide some more information about how to reproduce this issue? Thanks!

osresearch commented 3 months ago

The wrong onload kernel module probably explains the issue. I had rebuilt the user-space portions with the v8.1.3 tag and loaded the latest xilinx_efct module, but not built the updated onload kernel module (due to my own kernel version skew between the build server and the test rack). I'll track down the right kernel headers so that I can build the kernel module and retest with the matching version.

ligallag-amd commented 3 months ago

Just to add @osresearch, src/include/etherfabric/doxygen/040_using.dox:34 gives some information on our user kernel api.

osresearch commented 3 months ago

Thanks for the suggestion -- building the rest of the kernel modules with v8.1.3 made every work together again. I need to re-run some tests to see if this also fixed the CTPIO fallback poison issue that I was seeing.

It's a little surprising that a backwards incompatible change would be made in a patch-level release and I wonder if there is a way for the kernel to expose a hash of supported headers so that the wrong user space libraries won't try to talk to the wrong kernel modules.