Closed philipportner closed 1 month ago
Nice progress with this issue :ok_hand:
In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.
I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).
The CUDA kernels don't have so much impact as there are fewer of them, but they can also benefit of this imho. In general I'm positive towards more granularity in compilation and reduce the weight of libAllKernels.
I don't want to hijack the discussion here but take the opportunity to raise attention to #474 where I never received feedback. Running build.sh
from scratch does not only take a lot of time and resources for building but also the git checkout of the LLVM mono repo is a matter of a few minutes. Using a snapshot (which I'm mostly doing as the feature is implemented, but I didn't want to change defaults without discussion) speeds this up considerably. :speedboat:
Nice progress with this issue π
In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.
I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).
Mhm from what I've seen linking shouldn't take too long, are you sure cmake
is using lld
? How did you measure that?
The images are generated by parsing the build/.ninja_log
file with ninjatracing
and viewing here. See #516 for more.
lld
is linking Clang 19 (1.56 GiB) and Chromium 124 (1.35 GiB) in ~5 seconds, can't imagine it being a huge part in DAPHNE. Also not sure what other options we have besides mold which is from the same person as lld
.
Nice progress with this issue π In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style. I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).
I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it :see_no_evil:
Mhm from what I've seen linking shouldn't take too long, are you sure
cmake
is usinglld
? How did you measure that?The images are generated by parsing the
build/.ninja_log
file withninjatracing
and viewing here. See #516 for more.
:+1:
I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it π
There should be an early message output saying Using lld
. Otherwise I'm not really sure, I don't seem to have that problem :')
I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it π
There should be an early message output saying
Using lld
. Otherwise I'm not really sure, I don't seem to have that problem :')
I guess this does not work as it is saying both -- Using lld
and before that -- Linker detection: GNU ld
and link times stay slow.
I just tested these changes and can report an improvement from 2m47 -> 1m17 :1st_place_medal: :tada: :rocket:
Results from running in the container: Build reports success but libAllKernels.so is not there. Therefore, running anything fails.
Results from running in the container: Build reports success but libAllKernels.so is not there. Therefore, running anything fails.
That should only have happened if you didn't pull updates in this PR as this has been fixed in 6532eaa9651c30e61a30e913761b16ac891b67a3 . Could that be the case? For GPU to work you'd need the up to date PR.
Oh the PR is still in draft state. Further changes planned for this?
Oh the PR is still in draft state. Further changes planned for this?
Actually it's no longer in draft, maybe I shouldn't put that in the title :)
Tests are running fine now with and without CUDA. I'd merge it in once the mystery of 20 and 104 is resolved.
Thanks a lot for testing this with CUDA @corepointer !
Oh the PR is still in draft state. Further changes planned for this?
Actually it's no longer in draft, maybe I shouldn't put that in the title :)
my bad to jump to conclusions :see_no_evil:
LGTM - ran the test suite once again locally and fine tuned the commit message a bit while merging ;-)
Thx for your efforts @philipportner
This PR aims to improve compile times of DAPHNE itself (
./build.sh
). Improving the compilation times of a fully fresh build, including dependencies is not the goal of this PR.To establish some baselines, on my machine a fresh DAPHNE build normally takes about 2min 35s.
Before ~2min 35s
Here we can see a profiled run the current state (9a1b93e4fea27e91d14c18821be2c940adf63bf4)
I've made some changes, mainly splitting up the
kernels.cpp
file that is being generated bygenKernelInst.py
into compilation units per kernel. This improves parallelization as each translation unit can be built by itself. Yes, this does introduce some overhead in parsing the same headers in multiple translation units but for now the results are way better than before. I've also tried splitting into fewer translation units and splitting the kernels each into a translation unit provided the fastest compile times.Modifying and recompiling a single/couple headers now also only takes a few seconds as they only need to recompile the changed translation units instead of all the kernels.
Additionally, I've made the
libAllKernels
anOBJECT
library. This enables parallel compilation of the kernels library and the compiler parts.Current ~1min 30s
With that, compilation times come down to about 1min 30s. The result can be seen here.
There still are some heavy hitters, the biggest offenders being the files that depend on
MLIRDaphne
and includedaphne.h
and theEigenCal.h
kernel, which brings in theEigen
dependency.Removing these 4 kernels from compilation would reduce the compilation time down to 1min. IMO we should move these few kernels out of the
AllKernels
target to decouple the kernel runtime from the compiler infrastructure. Includingdaphne.h
introduces a significant amount of overhead in multiple translation units and is included all over the codebase. I think decoupling some of the MLIR related parts fromdaphne.h
would improve overall compilation times and should be done in the future as this PR concerns itself only with the kernels library.Open issues: