NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.31k stars 165 forks source link

[BUG]: cuda::ptx takes long to compile #2933

Closed ahendriksen closed 5 hours ago

ahendriksen commented 3 days ago

Is this a duplicate?

Type of Bug

Performance

Component

libcu++

Describe the bug

Including <cuda/ptx> takes ~800ms on my workstation.

How to Reproduce

Comparing the time to compile an empty file, a file including cuda/ptx and a file including cuda/std/__type_traits/integral_constant.h (which is included from cuda/ptx).

$ echo "" > empty.cu
$ echo "#include <cuda/ptx>" > cuda_ptx.cu
$ echo "#include <cuda/std/__type_traits/integral_constant.h>" > cuda_std_integral_constant.cu
$ hyperfine --warmup 1 'nvcc -arch sm_90a -x cu -c empty.cu -o test.o'  'nvcc -arch sm_90a -x cu -c cuda_ptx.cu -o test.o'  'nvcc -arch sm_90a -x cu -c cuda_std_integral_constant.cu -o test.o'

Benchmark 1: nvcc -arch sm_90a -x cu -c empty.cu -o test.o
  Time (mean ± σ):      1.434 s ±  0.012 s    [User: 1.070 s, System: 0.368 s]
  Range (min … max):    1.414 s …  1.455 s    10 runs

Benchmark 2: nvcc -arch sm_90a -x cu -c cuda_ptx.cu -o test.o
  Time (mean ± σ):      2.299 s ±  0.022 s    [User: 1.861 s, System: 0.442 s]
  Range (min … max):    2.275 s …  2.339 s    10 runs

Benchmark 3: nvcc -arch sm_90a -x cu -c cuda_std_integral_constant.cu -o test.o
  Time (mean ± σ):      2.131 s ±  0.023 s    [User: 1.709 s, System: 0.426 s]
  Range (min … max):    2.098 s …  2.167 s    10 runs

Expected behavior

This should not be a heavy header.

Reproduction link

No response

Operating System

Ubuntu Linux 22.04

nvidia-smi output

NA

NVCC version

Benchmark was performed using prerelease version of nvcc, but should be reproducible with any recent version.

ahendriksen commented 3 days ago

I have attaced a trace of the compile time. It can be checked in perfetto.dev.

Turns out that a large portion of the time is spent preprocessing the CUDA fp16 and bf16 headers. It is transitively included as follows:

cuda_ptx.json

bernhardmgruber commented 3 days ago

Yep, looks like the extended FP type headers are quite expensive, but since they are included as part of the CCCL config, they will affect each translation unit. @miscco could we consider only defining _CCCL_HAS_NVFP16 and _CCCL_HAS_NVBF16 in the CCCL config headers and leaving it up to downstream libraries and users to include the corresponding headers themselves?

miscco commented 3 days ago

yeah that would definitely be better