NVlabs / tiny-cuda-nn

Lightning fast C++/CUDA neural network framework
Other
3.66k stars 449 forks source link

Benchmarking #22

Open rmbrualla opened 2 years ago

rmbrualla commented 2 years ago

Hi! First, thanks for sharing this! It's super impressive.

I'm trying to benchmark tiny-cuda-nn on clang-cuda, and I'd like to compare it with the numbers in the graph in the README.md. What are were the parameters used to generate that graph? Is it just running both benchmarks on 'data/config.json' and changing the number of neurons from 128 to 64?

Thanks!

Tom94 commented 2 years ago

Yes, that's correct! The command line was

tiny-cuda-nn> .\build\bench_image_ours.exe .\data\images\albert.exr .\data\config.json

with n_neurons: 128 and n_neurons: 64, respectively.

The benchmark was run on Windows / MSVC 2019 / CUDA 11.3. Fan speed & power envelope of the GPU was also cranked to 100% and 114%, respectively, to minimize the impact of dynamic clocking. Unfortunately, the artificial 10-second pauses inbetween the measurements aren't quite enough to work around this in all cases. It's best to monitor GPU clock and temperature (e.g. using MSI Afterburner) to confirm.

rmbrualla commented 2 years ago

Thanks for clarifying!

I'm getting somewhat confusing results though. I had issues in building the project in my environment, and it is linked against CUTLASS 2.3, and some loop unrolling failed. Unfortunately, it's hard to pinpoint which loop unroll failed.

In any case, I observe lower performance than yours, except for the case of neurons=128, where I get 2x throughput, which is actually faster than the case of neurons=64 (close to 1e9 elements per second). Maybe there is a bug in my patches, I haven't checked for correctness. I also haven't looked into the profiler carefully -- I'm guessing some of the kernels are spilling. I am benchmarking on a 3090 without any power/fan tricks.

Also, what is the extent of the modifications of CUTLASS wrt the latest version available on github? I saw the PreReLU options in GemmShape, but those are only used for the resnet and I ignored them.

Tom94 commented 2 years ago

In any case, I observe lower performance than yours, except for the case of neurons=128, where I get 2x throughput, which is actually faster than the case of neurons=64 (close to 1e9 elements per second). Maybe there is a bug in my patches, I haven't checked for correctness. I also haven't looked into the profiler carefully -- I'm guessing some of the kernels are spilling. I am benchmarking on a 3090 without any power/fan tricks.

It might be worth verifying the correctness of the results (are the output images trained correctly?) to see whether something is wrong under the hood is affecting the performance numbers. As you say, 2x higher throughput sounds too good to be true. :)

Also, if your assessment is based on the console output rather than the emitted .json files, it's worth double-checking the ordering: the program first benchmarks CutlassMLP, which is expected to be slower than the graphs from the README, before benchmarking FullyFusedMLP. It also interleaves training and inference. In pseudocode, the ordering is:

for network in ["CutlassMLP", "FullyFusedMLP"]:
    for batch_size in [2**i for i in range(14, 21)]:
        bench_training_speed(network, batch_size)
        bench_inference_speed(network, batch_size)
Tom94 commented 2 years ago

Also, what is the extent of the modifications of CUTLASS wrt the latest version available on github? I saw the PreReLU options in GemmShape, but those are only used for the resnet and I ignored them.

I haven't actually followed CUTLASS development for a while, but the PreReLU option is indeed the only change I remember making at the time.