NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
876 stars 238 forks source link

Got different results on same devices and same tests #104

Closed HaoKang-Timmy closed 2 years ago

HaoKang-Timmy commented 2 years ago

I want to test the bandwidth cost of send and recv. First I type

 ./build/sendrecv_perf -b 8 -e 196M -g 2 -d uint8 -i 1M

on my terminal Then the result of two size of bits are

#                                               out-of-place                       in-place          
#       size         count      type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)  
    37748744      37748744     uint8    11732    3.22    3.22  0e+00    11698    3.23    3.23  3e+02

If I type

./build/sendrecv_perf -b 36M -e 36M -g 2 -d uint8

The result is

# nThread 1 nGpus 2 minBytes 37748736 maxBytes 37748736 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   5897 on         x0 device  0 [0x05] NVIDIA GeForce GTX 1080
#   Rank  1 Pid   5897 on         x0 device  1 [0x06] NVIDIA GeForce GTX 1080
#
#                                               out-of-place                       in-place          
#       size         count      type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    37748736      37748736     uint8   6485.7    5.82    5.82  0e+00   6496.1    5.81    5.81  3e+02
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 5.81563 

Why there are two types of bandwidths of the same input size?

sjeaugey commented 2 years ago

That's a good question, they should not in theory.

I see two reasons:

HaoKang-Timmy commented 2 years ago

That's a good question, they should not in theory.

I see two reasons:

  • Your performance in unstable. That should be easy to check in the first run (does the performance increase progressively or not).
  • The NCCL perf tests increment the offset in the buffer for each test, so since the first test is 8B all the subsequent tests are misaligned hence performance is reduced. If that's the case, replacing -b 8 by -b 1M should solve the issue.

Thank you, it seems that when I change 8 to 1M, The result becomes more resonable.