NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
MIT License
859 stars 143 forks source link

Inconsistent gdr read latency #243

Open anaanimous opened 1 year ago

anaanimous commented 1 year ago

I have a memory allocator based on gdr which initially allocates a large chunk of gdr memory (e.g. 16 MB) and then allocates pieces of this chunk to the subsequent memory requests. During performance benchmarking, I noticed that the read latency of the same memory size fluctuates quite significantly and I can't understand why. For example, if I allocate 3KB memory read 100 times and do the same thing again and again, the average read time fluctuates between 4.5 us and 70 us (i.e. 4.5 -> 70 -> 4.5 -> 70 ...) even though the same piece of memory is allocated for every 100 reads.

Here are some details regarding my settings:

pakmarkthub commented 1 year ago

Hi @anaanimous,

CPU and GPU clocks are usually the main cause (but not always) of performance fluctuation. Can you try the items below and rerun your test again?

  1. Fix the CPU clock or at least set your power governance to "performance" sudo cpupower frequency-set -g performance.
  2. Please also set the GPU clocks to max.
    
    # To view the max clock values of GPU 0
    $ nvidia-smi -i 0 -q
    ...
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1593 MHz
        Video                             : 1290 MHz
    ...

To set the clocks of GPU 0 to max

$ sudo nvidia-smi -i 0 -ac 1539,1410

anaanimous commented 1 year ago

Thank you for the quick response.

I have been running the benchmark on an AWS instance. However, today after running the same program on a local server, the performance has been stable. I don't know what's causing the fluctuation on AWS. It may be due to frequency scaling but I doubt it. Here is why:

I have a benchmark program where I measure the read latency for different sizes (1, 2, 4, ..., 1MB), similar to the copylat program, except that I use my allocator and its APIs to allocate memory and perform the reading. If I run this program on a local server the performance nicely matches that of the copylat. But on AWS the read latency for sizes above 512 bytes suddenly increases significantly (e.g. the latency of reading 512 bytes goes from 1.5 us to 12 us). But strangely enough, if I only skip the one-byte read (i.e. if I perform the test for 2, 4, ..., 1MB instead of 1, 2,... 1MB) the numbers match the copylat output.

pakmarkthub commented 1 year ago

Let's split into two topics here. The first one is the performance fluctuation, which seems to be resolved now. Depending on how your instance is allocated, I guess that you might share the host with other instances. I cannot say much about the performance predictability if you are not in full control of the entire system. There are so many external factors that can affect the performance.

The second topic is about the reading latency jumps to 12 us when reading 512 bytes. Can you share the code? I will try to reproduce this behavior on our system.