Closed osresearch closed 3 months ago
Hi @osresearch, thanks for raising this issue.
My understanding from reading your description is that you installed an older version of onload (e.g., v8.1.2) and then built eflatency
from the tag v8.1.3 and tried to run it under those conditions, is this correct? These symptoms do indeed look like a mismatch between kernel and userspace, but I expect the xilinx_efct
driver isn't the cause of this, and it's instead due to an older onload kernel module (without the mentioned patch that bumps CI_EFCT_MAX_SUPERBUFS
). Would you be able to verify that the onload user version and kernel version are in-sync? You should be able to see this from just running ./scripts/onload
by itself to see a header with this information, for example see below:
$ ./scripts/onload
Onload 8.1.3
Copyright (c) 2002-2024 Advanced Micro Devices, Inc.
Built: Apr 8 2015 16:20:35 (debug)
Build profile header: <ci/internal/transport_config_opt_cloud.h>
Kernel module: 8.1.3
[-snip usage-]
You have correctly identified that there was a comment suggesting that data structures before that patch were unable to cope with more than 512 superbufs, but surrounding work meant that was no longer the case (cc. @ligallag-amd who did quite a bit of work in this area).
If this isn't the case, would you be able to provide some more information about how to reproduce this issue? Thanks!
The wrong onload
kernel module probably explains the issue. I had rebuilt the user-space portions with the v8.1.3
tag and loaded the latest xilinx_efct
module, but not built the updated onload
kernel module (due to my own kernel version skew between the build server and the test rack). I'll track down the right kernel headers so that I can build the kernel module and retest with the matching version.
Just to add @osresearch, src/include/etherfabric/doxygen/040_using.dox:34 gives some information on our user kernel api.
Thanks for the suggestion -- building the rest of the kernel modules with v8.1.3
made every work together again. I need to re-run some tests to see if this also fixed the CTPIO fallback poison issue that I was seeing.
It's a little surprising that a backwards incompatible change would be made in a patch-level release and I wonder if there is a way for the kernel to expose a hash of supported headers so that the wrong user space libraries won't try to talk to the wrong kernel modules.
There might be version skew between the onload 8.1.3 and the xilinix_efct 1.6.3.0 driver for our X3 NICs. Using tags
v8.1.1
andv8.1.2
work fine, but buildingeflatency
from the tagv8.1.3
and running a simplepong
segfaults inefct_rx_next_header()
:I believe that changing the number of superbufs from 512 to 2048 in patch https://github.com/Xilinx-CNS/onload/commit/38130d26274ec04020d2ab08586b3225da489199 caused the problem. There was a comment that "
With current data structures, the value should be left at 512
", so are there some structures in the kernel driver that also need to be updated? If I revert this patch and update the header hash, onload 8.1.3 works with our X3 cards: