NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

failed on WeightOnly test on V100 #1155

Open EatenBagpipe opened 6 months ago

EatenBagpipe commented 6 months ago

System Info

Who can help?

@Tracin

Information

Tasks

Reproduction

# prepare
make -C docker build
make -C docker run
# compile
python3 ./scripts/build_wheel.py --cuda_architectures "70-real" --trt_root /usr/local/tensorrt -j 100
cd cpp/build
make -j
./tests/weightOnlyKernelTest

Expected behavior

It's should pass all perChannel tests, and raise an expect when running finegraind quantization test.

actual behavior

[----------] 1 test from Kernel
[ RUN      ] Kernel.WeightOnly
benchmark mnk (1, 512, 512) FP16 Activation Int8b PerChannel Weight Only
max diff 3538.000000 (diff threshold 29.460938), avg diff 520.324951, diff cnt 512/512
cuda kernel cost time 0.003994, cutlass kernel cost time 0.013175, cuda speedup 3.299
/workspace/TensorRT-LLM/cpp/tests/kernels/weightOnly/weightOnlyKernelTest.cpp:455: Failure
Value of: pass
  Actual: false
Expected: true
benchmark mnk (1, 512, 512) FP16 Activation Int4b PerChannel Weight Only
max diff 171.562500 (diff threshold 20.800781), avg diff 30.527262, diff cnt 512/512
cuda kernel cost time 0.003857, cutlass kernel cost time 0.012971, cuda speedup 3.363
/workspace/TensorRT-LLM/cpp/tests/kernels/weightOnly/weightOnlyKernelTest.cpp:458: Failure
Value of: pass
  Actual: false
Expected: true
benchmark mnk (1, 512, 512) BF16 Activation Int8b PerChannel Weight Only
max diff 1720.000000 (diff threshold 40.312500), avg diff 338.732117, diff cnt 512/512
cuda kernel cost time 0.002423, cutlass kernel cost time 0.015189, cuda speedup 6.268
/workspace/TensorRT-LLM/cpp/tests/kernels/weightOnly/weightOnlyKernelTest.cpp:462: Failure
Value of: pass
  Actual: false
Expected: true
benchmark mnk (1, 512, 512) BF16 Activation Int4b PerChannel Weight Only
max diff 162.000000 (diff threshold 60.750000), avg diff 23.341923, diff cnt 512/512
cuda kernel cost time 0.002458, cutlass kernel cost time 0.015292, cuda speedup 6.222
/workspace/TensorRT-LLM/cpp/tests/kernels/weightOnly/weightOnlyKernelTest.cpp:465: Failure
Value of: pass
  Actual: false
Expected: true
benchmark mnk (1, 512, 512) FP16 Activation Int8b GroupWise64 Weight Only
unknown file: Failure
C++ exception with description "[TensorRT-LLm Error][filter_and_run_mixed_gemm] Cutlass fpA_intB gemm not implemented for arch 70 with finegraind weight-only quantization." thrown in the test body.
[  FAILED  ] Kernel.WeightOnly (2447 ms)
[----------] 1 test from Kernel (2447 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2447 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Kernel.WeightOnly

additional notes

I compiled trt-llm from its source code on a V100 server and attempted to execute the WeightOnly unittest, but it resulted in a failure. Although I am aware that the Volta architecture does not support fine-grained weight-only quantization, I anticipated that the perChannel quantization would successfully pass the unittest as expected.

Tracin commented 6 months ago

You can set gss to only zero. https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tests/kernels/weightOnly/weightOnlyKernelTest.cpp#L444

EatenBagpipe commented 6 months ago

@Tracin Thanks for your reply. Within the actual behavior section's initial batch of test scenarios, it seems that the condition for gss being 0 is satisfied. But there is still a significant difference between the results of CudaKernel and CutlassKernel.

I found that the implementation of CudaKernel by default assumes the layout for weight is ColumnMajorTileInterleave, whereas the CutlassKernel used on sm70 devices utilize a ColumnMajor layout. This difference leads to the inconsistency in the final calculation results.

Maybe the support for ColumnMajor in the weightOnlyKernelTest could be added, or just skip this test for sm70 devices.

Tracin commented 6 months ago

@Barry-Delaney Hi Barry, Can you help with this? Thanks!

Barry-Delaney commented 6 months ago

Hi @EatenBagpipe, thanks for the feedback! Sm70 support for weight only CUDA kernels is ongoing now, and will be in the next few code updates.