Faster compile/ci - Githubissues

Nvcc compilation profile has changed drastically now that gqa_group_size is an input arg and no longer a template parameter.

This PR improves compile time by ~20% on my dev machine. Result may vary due to diff env but I expect a net positive overall.

Env: 13900K PCores: 5.6GHz + ECores: 4.0GHz (Both are OCed). Total of 32 hw threads. TEST: use scripts/run-ci-build-wheel.sh and time compile to step(20) completion.

env FLASHINFER_CI_PYTHON_VERSION=3.11 FLASHINFER_CI_TORCH_VERSION=2.3.1 FLASHINFER_CI_CUDA_VERSION=12.4 FLASHINFER_BUILD_VERSION=0.0.4 TORCH_CUDA_ARCH_LIST=“8.0;8.6;8.9"

nvcc_threads=8 41.01s to step20 MAX_JOBS=16 <-- current default
nvcc_threads=2 41.21s to step20 MAX_JOBS=16
nvcc_threads=1 50.97s to step20 MAX_JOBS=16
nvcc_threads=4 40.83s to step20 MAX_JOBS=16
nvcc_threads=4 1m15s  to step20 MAX_JOBS=8
nvcc_threads=1 32s    to step20 MAX_JOBS=32 <-- fastest (PR)
nvcc_threads=2 38s    to step20 MAX_JOBS=32

Based on the tests, main now favors processes/jobs vs threads for nvcc.

flashinfer-ai / flashinfer

Faster compile/ci #305