Improved DMA adjacency descriptor prefetch function in performance mode

AmarisEx commented 8 months ago

Question In the performance monitoring mode (i.e., when running the perform_hwcount.sh script), the data transmission rate measured through experiments is significantly lower than what is demonstrated in the official Xilinx example (see video link: https://www.youtube.com/watch?v=WcEvAvtXL94). This discrepancy is particularly evident when transmitting data packets ranging from 0 to 4K in size.
Debugging During debugging, the issue was traced back to the driver's lack of DMA descriptor prefetch functionality. Specifically, within the libxdma.c file's xdma_performance_submit function, the filling of the adjacent descriptor count field was not implemented. This results in XDMA operating inefficiently by fetching descriptors one by one instead of retrieving multiple adjacent descriptors at once.

Solution The relevant adjacent descriptor counting logic code has been added to the xdma_performance_submit function in libxdma.c, as follows:

  /* fill in adjacent numbers */
  transfer->desc_adjacent = transfer->desc_num - 1;

  for (i = 0; i < transfer->desc_num; i++) {
      u32 next_adj = xdma_get_next_adj(transfer->desc_num - i - 1,
                      (transfer->desc_virt + i)->next_lo);

      xdma_desc_adjacent(transfer->desc_virt + i, next_adj);
      dbg_desc("prefetch: dma desc addr: %x_%x, control:%x", transfer->desc_virt[i].next_hi, transfer->desc_virt[i].next_lo, transfer->desc_virt[i].control);
  }

Result

MischaBaars commented 8 months ago

I was about the reproduce your results and maybe add a little to it, but it seems the performance.c Mr. Jason Lawley is using differs from the performance.c included with the XDMA drivers. His performance.c shows the Gbytes/s in green, this performance.c only shows '*****'. The link Jason provides in his video is outdated.

Here is the replacement link and some useful stuff: 68049 - DMA Subsystem for PCI Express (Vivado 2016.3) - Performance Numbers Theoretical maximum read transaction throughout is too low Understanding Performance of PCI Express Systems (WP350)

What performance.c did you use to plot the graph? How do you compute the data rate?

AmarisEx commented 8 months ago

I was about the reproduce your results and maybe add a little to it, but it seems the performance.c Mr. Jason Lawley is using differs from the performance.c included with the XDMA drivers. His performance.c shows the Gbytes/s in green, this performance.c only shows '*****'. The link Jason provides in his video is outdated.

Here is the replacement link and some useful stuff: 68049 - DMA Subsystem for PCI Express (Vivado 2016.3) - Performance Numbers Theoretical maximum read transaction throughout is too low Understanding Performance of PCI Express Systems (WP350)

What performance.c did you use to plot the graph? How do you compute the data rate?

I am extremely grateful for the links you provided, as they have been instrumental in helping me understand the calculation of rates in PCI Express (PCIe). I have also encountered the issue of the existing performance.c file only displaying '*****' instead of specific performance data when I attempt to use it.

Below is an example of a performance.c script that I have written myself. It includes the method I use to calculate the data transfer rate. I hope this example will be useful to you. If you find any errors or areas for improvement, please do not hesitate to point them out. Thank you very much.

/*
 * This file is part of the Xilinx DMA IP Core driver tools for Linux
 *
 * Copyright (c) 2016-present,  Xilinx, Inc.
 * All rights reserved.
 *
 * This source code is licensed under BSD-style license (found in the
 * LICENSE file in the root directory of this source tree)
 */

#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <limits.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>

#include "../xdma/cdev_sgdma.h"

struct xdma_performance_ioctl perf;

static struct option const long_opts[] =
{
  {"device", required_argument, NULL, 'd'},
  {"count", required_argument, NULL, 'c'},
  {"size", required_argument, NULL, 's'},
  {"freq", required_argument, NULL, 'f'},
  {"incremental", no_argument, NULL, 'i'},
  {"non-incremental", no_argument, NULL, 'n'},
  {"verbose", no_argument, NULL, 'v'},
  {"help", no_argument, NULL, 'h'},
  {0, 0, 0, 0}
};

static void usage(const char* name)
{
  int i = 0;
  printf("%s\n\n", name);
  printf("usage: %s [OPTIONS]\n\n", name);
  printf("Performance test for XDMA SGDMA engine.\n\n");

  printf("  -%c (--%s) device\n", long_opts[i].val, long_opts[i].name); i++;
  printf("  -%c (--%s) incremental\n", long_opts[i].val, long_opts[i].name); i++;
  printf("  -%c (--%s) non-incremental\n", long_opts[i].val, long_opts[i].name); i++;
  printf("  -%c (--%s) xdma axi clock frequency\n", long_opts[i].val, long_opts[i].name); i++;
  printf("  -%c (--%s) be more verbose during test\n", long_opts[i].val, long_opts[i].name); i++;
  printf("  -%c (--%s) print usage help and exit\n", long_opts[i].val, long_opts[i].name); i++;
}

static uint32_t getopt_integer(char *optarg)
{
  int rc;
  uint32_t value;
  rc = sscanf(optarg, "0x%x", &value);
  if (rc <= 0)
    rc = sscanf(optarg, "%ul", &value);
  return value;
}

int test_dma(char *device_name, int size, int count, int freq);

static int verbosity = 0;

int main(int argc, char *argv[])
{
  int cmd_opt;
  char *device = "/dev/xdma/card0/h2c0";
  uint32_t size = 32768;
  uint32_t count = 1;
  uint32_t freq = 250;  // AXI Clock Frequency MHz
  char *filename = NULL;

  while ((cmd_opt = getopt_long(argc, argv, "vhic:d:s:f:", long_opts, NULL)) != -1)
  {
    switch (cmd_opt)
    {
      case 0:
        /* long option */
        break;
      case 'v':
        verbosity++;
        break;
      /* device node name */
      case 'd':
        printf("'%s'\n", optarg);
        device = strdup(optarg);
        break;
      /* transfer size in bytes */
      case 's':
        size = getopt_integer(optarg);
        break;
      /* count */
      case 'c':
        count = getopt_integer(optarg);
        break;
      /* xdma axi freq (maybe 250MHz) */
      case 'f':
        freq = getopt_integer(optarg);
        break;
      /* print usage help and exit */
      case 'h':
      default:
        usage(argv[0]);
        exit(0);
        break;
    }
  }
  printf("\tdevice = %s, size = 0x%08x, count = %u\n", device, size, count);
  test_dma(device, size, count, freq);

}

int test_dma(char *device_name, int size, int count, int freq)
{
  int rc = 0;
  int fd = open(device_name, O_RDWR);
  if (fd < 0) {
      printf("\tFAILURE: Could not open %s. Make sure xdma device driver is loaded and you have access rights (maybe use sudo?).\n", device_name);
      exit(1);
  }

  unsigned char status = 1;

  perf.version = IOCTL_XDMA_PERF_V1;
  perf.transfer_size = size;
  rc = ioctl(fd, IOCTL_XDMA_PERF_START, &perf);
#if 1
  while (count--) {

    sleep(2);
    rc = ioctl(fd, IOCTL_XDMA_PERF_GET, &perf);
  }
#endif
  rc = ioctl(fd, IOCTL_XDMA_PERF_STOP, &perf);
  long long data_transferred = (long long)perf.transfer_size * (long long)perf.iterations;
  long double data_rate = 0;
  long long data_duty_cycle = 0;

  if ((long long)perf.clock_cycle_count != 0) {
    data_rate = (long double) data_transferred * (long double) freq / (long double)perf.clock_cycle_count / 1000;
    data_duty_cycle = (long long)perf.data_cycle_count * 100 / (long long)perf.clock_cycle_count;
  }
  printf("\t \033[;32m Date Rate = %Lf GBytes/s\033[0m, data duty cycle = %lld%%\n\n", data_rate, data_duty_cycle);

  close(fd);
}

MischaBaars commented 8 months ago

Great! Thank you so much for the starting point. Let me get back to you! I have a lot of work to do :)

One thing: When I'm using Lane Width X8, Maximum Link Speed 8.0 GT/s, and I don't touch the number of channels, I get a data rate of 7.39 GBytes/s at 256K while I see only number below 6.00 GBytes/s in your graph. What settings did you use for the IP Example Design's PCIe Interface?

AmarisEx commented 8 months ago

Great! Thank you so much for the starting point. Let me get back to you! I have a lot of work to do :)

One thing: When I'm using Lane Width X8, Maximum Link Speed 8.0 GT/s, and I don't touch the number of channels, I get a data rate of 7.39 GBytes/s at 256K while I see only number below 6.00 GBytes/s in your graph. What settings did you use for the IP Example Design's PCIe Interface?

This is also my confusion. Instead of using the IP Example Design, I built an XDMA-DDR3 transmission path for testing The main configurations of my project are as follows:

VC709 Board
X8, 100MHz PCIe ref clock
AXI Memory Mapped, 250MHz
Four channels but only one used in test
MSI-X

MischaBaars commented 8 months ago

Ok. Here's what I think:

When looking at the performance.c source code, you'll find a sleep(2) in the loop that makes the script fail when it is removed. Making code depend on sleep() is never a good idea. God knows what the kernel driver is doing in the meantime.

A more realistic scenario is to measure the PCIe throughput while actually writing and reading data. The throughput will be dependent on the both the PCIe settings and the (B)RAM specifications, and is probably for that reason significantly lower than the 'thoughput' as seen from the perform_hwcount.sh script.

I did not want to disappoint you, but after benchmarking the bitstreams for all possible different PCIe settings, I have to say that I do not see an apparent difference between the original and the patched kernel driver's performance. The lighter colors are from the patched kernel driver, the darker colors are from the original kernel driver.

dma_ip_drivers_commit_264

Maybe it's worth noticing is that the benchmark of my Kingston Fury Renegade PCIe 4.0 x4 SSD this morning, showed a maximum read speed of 17.07 Gbytes/s and and maximum write speed of 2.28 Gbytes/s. According to the specifications that should have been 7.3 Gbytes/s read speed and 7.3 Gbytes/s write speed. But, if I'm not mistaken PCIe 4.0 x4 should perform exactly the same as PCIe 3.0 x8. Well, it doesn't, as you can see from the graph :) BRAM and DDR are essential to the performance of the design as a whole, and in my case the bottleneck if I'm not mistaken.

MischaBaars commented 8 months ago

I almost forgot.

So the question is, where did you learn about descriptor prefetching?

AmarisEx commented 8 months ago

Ok. Here's what I think:

When looking at the performance.c source code, you'll find a sleep(2) in the loop that makes the script fail when it is removed. Making code depend on sleep() is never a good idea. God knows what the kernel driver is doing in the meantime.

A more realistic scenario is to measure the PCIe throughput while actually writing and reading data. The throughput will be dependent on the both the PCIe settings and the (B)RAM specifications, and is probably for that reason significantly lower than the 'thoughput' as seen from the perform_hwcount.sh script.

I did not want to disappoint you, but after benchmarking the bitstreams for all possible different PCIe settings, I have to say that I do not see an apparent difference between the original and the patched kernel driver's performance. The lighter colors are from the patched kernel driver, the darker colors are from the original kernel driver.

Maybe it's worth noticing is that the benchmark of my Kingston Fury Renegade PCIe 4.0 x4 SSD this morning, showed a maximum read speed of 17.07 Gbytes/s and and maximum write speed of 2.28 Gbytes/s. According to the specifications that should have been 7.3 Gbytes/s read speed and 7.3 Gbytes/s write speed. But, if I'm not mistaken PCIe 4.0 x4 should perform exactly the same as PCIe 3.0 x8. Well, it doesn't, as you can see from the graph :) BRAM and DDR are essential to the performance of the design as a whole, and in my case the bottleneck if I'm not mistaken.

I forgot one important point: whether you checked the Extended Tag Field of the XDMA configuration. In my experiments, if this option is unchecked, then patched performance.c does not improve the rate.

AmarisEx commented 8 months ago

I almost forgot.

So the question is, where did you learn about descriptor prefetching?

The so-called DMA prefetch means that when the DMA engine fetches descriptors from the host buffer, it not only retrieves the current descriptor, but also retrieves more descriptors based on the adjacent descriptor count field.

The concept didn't come from anywhere, it was just summarized for the sake of describing the matter.

MischaBaars commented 8 months ago

Let me work on it a little.

I don't trust that sleep(2). Engine registers perf_pnd_hi and perf_pnd_lo are undocumented for example.

I'll get back to you.

MischaBaars commented 8 months ago

And yes, the Extended Tag Field was already checked.

MischaBaars commented 7 months ago

I believe that descriptor prefetching was already implemented for DMA transfers initiated through sgdma_fops.write.

Besides from xdma_performance_submit, xdma_get_next_adj was already being called from engine_start, transfer_init and xdma_xfer_aperture.

I have been testing against DMA transfers that do not make use of your modified xdma_performance_submit , but I have not been testing against an engine that did not already support descriptor prefetching, am I correct?

Xilinx / dma_ip_drivers

Improved DMA adjacency descriptor prefetch function in performance mode #264