Nvcc compilation profile has changed drastically now that gqa_group_size is an input arg and no longer a template parameter.
This PR improves compile time by ~20% on my dev machine. Result may vary due to diff env but I expect a net positive overall.
Env: 13900K PCores: 5.6GHz + ECores: 4.0GHz (Both are OCed). Total of 32 hw threads.
TEST: use scripts/run-ci-build-wheel.sh and time compile to step(20) completion.
Nvcc compilation profile has changed drastically now that
gqa_group_size
is an input arg and no longer a template parameter.This PR improves compile time by ~20% on my dev machine. Result may vary due to diff env but I expect a net positive overall.
Env: 13900K PCores: 5.6GHz + ECores: 4.0GHz (Both are OCed). Total of 32 hw threads. TEST: use scripts/run-ci-build-wheel.sh and time compile to step(20) completion.
Based on the tests, main now favors processes/jobs vs threads for nvcc.