Multiple errors when compiling l2fwd-nv

I have multiple errores when trying to compule the l2fwd-nv example.

This is my CUDA drivers and version

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

The errors that i have encountred are detailed as follow:

First, the link to download Mellanox OFED 5.4 (http://www.mellanox.com/page/products_dyn?product_family=26) is broke and it can be installed.

Second, after following all the steps described in the readme (except the previous one) I got the following warning when executed the cmake .. command:

CMake Warning (dev) in CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "l2fwdnv".
This warning is for project developers.  Use -Wno-dev to suppress it.

Guessing that it did not affect the compilation, I continued with the readme commands, and after ran the make -j$(nproc --all) command I got the following error:

[ 33%] Building CUDA object CMakeFiles/l2fwdnv.dir/src/kernel.cu.o
/home/user/Documentos/l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/install/include/rte_common.h(879): warning #1217-D: unrecognized format function type "gnu_printf" ignored

/home/user/Documentos/l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/install/include/rte_log.h(291): warning #1217-D: unrecognized format function type "gnu_printf" ignored

/home/user/Documentos/l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/install/include/rte_log.h(320): warning #1217-D: unrecognized format function type "gnu_printf" ignored

/home/user/Documentos/l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/install/include/rte_debug.h(69): warning #1217-D: unrecognized format function type "gnu_printf" ignored

/usr/lib/gcc/x86_64-linux-gnu/11/include/serializeintrin.h(41): error: identifier "__builtin_ia32_serialize" is undefined

1 error detected in the compilation of "/home/user/Documentos/l2fwd-nv/src/kernel.cu".
make[2]: *** [CMakeFiles/l2fwdnv.dir/build.make:76: CMakeFiles/l2fwdnv.dir/src/kernel.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:139: CMakeFiles/l2fwdnv.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

Guessing that I could comment that source code line, I modified file /usr/lib/gcc/x86_64-linux-gnu/11/include/serializeintrin.h, line 41 and commented the call to the __builtin_ia32_serialize function and despite all the previous error I compiled again and it "worked".

Third, when I ran the ./l2fwdnv -h I got an output entirely different that one shown in the readme of the github (In fact, the arguments provided in this github does not work):

************ L2FWD-NV ************

EAL: Detected CPU lcores: 24
EAL: Detected NUMA nodes: 1

Usage: ./l2fwdnv [options]

EAL common options:
  -c COREMASK         Hexadecimal bitmask of cores to run on
  -l CORELIST         List of cores to run on
                      The argument format is <c1>[-c2][,c3[-c4],...]
                      where c1, c2, etc are core indexes between 0 and 128
  --lcores COREMAP    Map lcore set to physical cpu set
                      The argument format is
                            '<lcores[@cpus]>[<,lcores[@cpus]>...]'
                      lcores and cpus list are grouped by '(' and ')'
                      Within the group, '-' is used for range separator,
                      ',' is used for single number separator.
                      '( )' can be omitted for single element group,
                      '@' can be omitted if cpus and lcores have the same value
  -s SERVICE COREMASK Hexadecimal bitmask of cores to be used as service cores
  --main-lcore ID     Core ID that is used as main
  --mbuf-pool-ops-name Pool ops name for mbuf to use
  -n CHANNELS         Number of memory channels
  -m MB               Memory to allocate (see also --socket-mem)
  -r RANKS            Force number of memory ranks (don't detect)
  -b, --block         Add a device to the blocked list.
                      Prevent EAL from using this device. The argument
                      format for PCI devices is <domain:bus:devid.func>.
  -a, --allow         Add a device to the allow list.
                      Only use the specified devices. The argument format
                      for PCI devices is <[domain:]bus:devid.func>.
                      This option can be present several times.
                      [NOTE: allow cannot be used with block option]
  --vdev              Add a virtual device.
                      The argument format is <driver><id>[,key=val,...]
                      (ex: --vdev=net_pcap0,iface=eth2).
  --iova-mode   Set IOVA mode. 'pa' for IOVA_PA
                      'va' for IOVA_VA
  -d LIB.so|DIR       Add a driver or driver directory
                      (can be used multiple times)
  --vmware-tsc-map    Use VMware TSC map instead of native RDTSC
  --proc-type         Type of this process (primary|secondary|auto)
  --syslog            Set syslog facility
  --log-level=<level> Set global log level
  --log-level=<type-match>:<level>
                      Set specific log level
  --log-level=help    Show log types and levels
  --trace=<regex-match>
                      Enable trace based on regular expression trace name.
                      By default, the trace is disabled.
                      User must specify this option to enable trace.
  --trace-dir=<directory path>
                      Specify trace directory for trace output.
                      By default, trace output will created at
                      $HOME directory and parameter must be
                      specified once only.
  --trace-bufsz=<int>
                      Specify maximum size of allocated memory
                      for trace output for each thread. Valid
                      unit can be either 'B|K|M' for 'Bytes',
                      'KBytes' and 'MBytes' respectively.
                      Default is 1MB and parameter must be
                      specified once only.
  --trace-mode=<o[verwrite] | d[iscard]>
                      Specify the mode of update of trace
                      output file. Either update on a file can
                      be wrapped or discarded when file size
                      reaches its maximum limit.
                      Default mode is 'overwrite' and parameter
                      must be specified once only.
  -v                  Display version information on startup
  -h, --help          This help
  --in-memory   Operate entirely in memory. This will
                      disable secondary process support
  --base-virtaddr     Base virtual address
  --telemetry   Enable telemetry support (on by default)
  --no-telemetry   Disable telemetry support
  --force-max-simd-bitwidth Force the max SIMD bitwidth

EAL options for DEBUG use only:
  --huge-unlink[=existing|always|never]
                      When to unlink files in hugetlbfs
                      ('existing' by default, no value means 'always')
  --no-huge           Use malloc instead of hugetlbfs
  --no-pci            Disable PCI
  --no-hpet           Disable HPET
  --no-shconf         No shared config (mmap'd files)

EAL Linux options:
  --socket-mem        Memory to allocate on sockets (comma separated values)
  --socket-limit      Limit memory allocation on sockets (comma separated values)
  --huge-dir          Directory where hugetlbfs is mounted
  --file-prefix       Prefix for hugepage filenames
  --create-uio-dev    Create /dev/uioX (usually done by hotplug)
  --vfio-intr         Interrupt mode for VFIO (legacy|msi|msix)
  --vfio-vf-token     VF token (UUID) shared between SR-IOV PF and VFs
  --legacy-mem        Legacy memory mode (no dynamic allocation, contiguous segments)
  --single-file-segments Put all hugepage memory in single files
  --match-allocations Free hugepages exactly as allocated

When the output should be:

./build/l2fwdnv [EAL options] -- b|c|d|e|g|m|s|t|w|B|E|N|P|W
 -b BURST SIZE: how many pkts x burst to RX
 -d DATA ROOM SIZE: mbuf payload size
 -g GPU DEVICE: GPU device ID
 -m MEMP TYPE: allocate mbufs payloads in 0: host pinned memory, 1: GPU device memory
 -n CUDA PROFILER: Enable CUDA profiler with NVTX for nvvp
 -p PIPELINES: how many pipelines (each with 1 RX and 1 TX cores) to use
 -s BUFFER SPLIT: enable buffer split, 64B CPU, remaining bytes GPU
 -t PACKET TIME: force workload time (nanoseconds) per packet
 -v PERFORMANCE PKTS: packets to be received before closing the application. If 0, l2fwd-nv keeps running until the CTRL+C
 -w WORKLOAD TYPE: who is in charge to swap the MAC address, 0: No swap, 1: CPU, 2: GPU with one dedicated CUDA kernel for each burst of received packets, 3: GPU with a persistent CUDA kernel, 4: GPU with CUDA Graphs
 -z WARMUP PKTS: wait this amount of packets before starting to measure performance

I checked the utils.cpp file inside the src folder and I can see the correct get_opt options

void l2fwdnv_usage(const char *prgname)
{
        printf("\n\n%s [EAL options] -- b|c|d|e|g|m|s|t|w|B|E|N|P|W\n"
               " -b BURST SIZE: how many pkts x burst to RX\n"
               " -d DATA ROOM SIZE: mbuf payload size\n"
               " -g GPU DEVICE: GPU device ID\n"
               " -m MEMP TYPE: allocate mbufs payloads in 0: host pinned memory, 1: GPU device memory\n"
                   " -n CUDA PROFILER: Enable CUDA profiler with NVTX for nvvp\n"
                   " -p PIPELINES: how many pipelines (each with 1 RX and 1 TX cores) to use\n"
                   " -s BUFFER SPLIT: enable buffer split, 64B CPU, remaining bytes GPU\n"
                   " -t PACKET TIME: force exec time (nanoseconds) per packet\n"
               " -v PERFORMANCE PKTS: packets to be received before closing the application. If 0, l2fwd-nv keeps running until the CTRL+C\n"
                   " -w WORKLOAD TYPE: who is in charge to swap the MAC address, 0: No swap, 1: CPU, 2: GPU with one dedicated CUDA kernel for each burst of received packets,>
               " -z WARMUP PKTS: wait this amount of packets before starting to measure performance\n",
               prgname);
}

Therefore, I am compiling the correct source code but, for some reason, the resulting binary is not behave correctly.

So, the questions are the following:

How can I download Mellanox OFED 5.4?
The warning that I got when running cmake are important and need to be solved? If so, how can I fix those warnings?
Why I am getting an error on the __builtin_ia32_serialize function and how can I solve it?
Why I am getting a totaly different help message? and why the arguments provided in this github (and in the source file) do not work?

About the last question, I guess that I am running (or compiling) another example or another example version different that the one showed in this github. However, I followed all the steps in this github, so maybe the github miss some key step or anything that need to be run to compile and run the proper example

Thanks you

NVIDIA / l2fwd-nv

Multiple errors when compiling l2fwd-nv #8