daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

Improve compile times #829

Closed philipportner closed 1 month ago

philipportner commented 1 month ago

This PR aims to improve compile times of DAPHNE itself (./build.sh). Improving the compilation times of a fully fresh build, including dependencies is not the goal of this PR.

To establish some baselines, on my machine a fresh DAPHNE build normally takes about 2min 35s.

Before ~2min 35s

Here we can see a profiled run the current state (9a1b93e4fea27e91d14c18821be2c940adf63bf4)

image

I've made some changes, mainly splitting up the kernels.cpp file that is being generated by genKernelInst.py into compilation units per kernel. This improves parallelization as each translation unit can be built by itself. Yes, this does introduce some overhead in parsing the same headers in multiple translation units but for now the results are way better than before. I've also tried splitting into fewer translation units and splitting the kernels each into a translation unit provided the fastest compile times.

Modifying and recompiling a single/couple headers now also only takes a few seconds as they only need to recompile the changed translation units instead of all the kernels.

Additionally, I've made the libAllKernels an OBJECT library. This enables parallel compilation of the kernels library and the compiler parts.

Current ~1min 30s

With that, compilation times come down to about 1min 30s. The result can be seen here.

image

There still are some heavy hitters, the biggest offenders being the files that depend on MLIRDaphne and include daphne.h and the EigenCal.h kernel, which brings in the Eigen dependency.

Removing these 4 kernels from compilation would reduce the compilation time down to 1min. IMO we should move these few kernels out of the AllKernels target to decouple the kernel runtime from the compiler infrastructure. Including daphne.h introduces a significant amount of overhead in multiple translation units and is included all over the codebase. I think decoupling some of the MLIR related parts from daphne.h would improve overall compilation times and should be done in the future as this PR concerns itself only with the kernels library.

Open issues:

corepointer commented 1 month ago

Nice progress with this issue :ok_hand:

In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.

I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).

corepointer commented 1 month ago

The CUDA kernels don't have so much impact as there are fewer of them, but they can also benefit of this imho. In general I'm positive towards more granularity in compilation and reduce the weight of libAllKernels.

corepointer commented 1 month ago

I don't want to hijack the discussion here but take the opportunity to raise attention to #474 where I never received feedback. Running build.sh from scratch does not only take a lot of time and resources for building but also the git checkout of the LLVM mono repo is a matter of a few minutes. Using a snapshot (which I'm mostly doing as the feature is implemented, but I didn't want to change defaults without discussion) speeds this up considerably. :speedboat:

philipportner commented 1 month ago

Nice progress with this issue πŸ‘Œ

In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.

I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).

Mhm from what I've seen linking shouldn't take too long, are you sure cmake is using lld? How did you measure that?

The images are generated by parsing the build/.ninja_log file with ninjatracing and viewing here. See #516 for more.

philipportner commented 1 month ago

lld is linking Clang 19 (1.56 GiB) and Chromium 124 (1.35 GiB) in ~5 seconds, can't imagine it being a huge part in DAPHNE. Also not sure what other options we have besides mold which is from the same person as lld.

corepointer commented 1 month ago

Nice progress with this issue πŸ‘Œ In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style. I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).

I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it :see_no_evil:

Mhm from what I've seen linking shouldn't take too long, are you sure cmake is using lld? How did you measure that?

The images are generated by parsing the build/.ninja_log file with ninjatracing and viewing here. See #516 for more.

:+1:

philipportner commented 1 month ago

I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it πŸ™ˆ

There should be an early message output saying Using lld. Otherwise I'm not really sure, I don't seem to have that problem :')

corepointer commented 1 month ago

I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it πŸ™ˆ

There should be an early message output saying Using lld. Otherwise I'm not really sure, I don't seem to have that problem :')

I guess this does not work as it is saying both -- Using lld and before that -- Linker detection: GNU ld and link times stay slow.

corepointer commented 1 month ago

I just tested these changes and can report an improvement from 2m47 -> 1m17 :1st_place_medal: :tada: :rocket:

corepointer commented 1 month ago

Results from running in the container: Build reports success but libAllKernels.so is not there. Therefore, running anything fails.

philipportner commented 1 month ago

Results from running in the container: Build reports success but libAllKernels.so is not there. Therefore, running anything fails.

That should only have happened if you didn't pull updates in this PR as this has been fixed in 6532eaa9651c30e61a30e913761b16ac891b67a3 . Could that be the case? For GPU to work you'd need the up to date PR.

corepointer commented 1 month ago

Oh the PR is still in draft state. Further changes planned for this?

philipportner commented 1 month ago

Oh the PR is still in draft state. Further changes planned for this?

Actually it's no longer in draft, maybe I shouldn't put that in the title :)

philipportner commented 1 month ago

Tests are running fine now with and without CUDA. I'd merge it in once the mystery of 20 and 104 is resolved.

Thanks a lot for testing this with CUDA @corepointer !

corepointer commented 1 month ago

Oh the PR is still in draft state. Further changes planned for this?

Actually it's no longer in draft, maybe I shouldn't put that in the title :)

my bad to jump to conclusions :see_no_evil:

corepointer commented 1 month ago

LGTM - ran the test suite once again locally and fine tuned the commit message a bit while merging ;-)

Thx for your efforts @philipportner