Closed sayerhs closed 2 years ago
@sayerhs That's great, I'm glad you were able to get the per-target build times. It looks like the ngp_algorithms files are the worst offenders. I think any compilation-unit with cuda code has to be compiled twice, and then templating probably compounds that. I wonder if we have any ngp kernels in headers that can be moved to .C files, or if we have too much code in headers in general, and could move more to .C files. Or, perhaps we have some .C files that are including un-needed headers...
@sayerhs One more thing: the stk ngp stuff could be a culprit, it is all header-only, and perhaps much of it doesn't need to be header-only. I'll look into that.
@sayerhs Just from browsing the code, I wonder if we could reduce build times by splitting NgpLoopUtils.h into several separate headers. For instance I see that many .C files call run_entity_algorithm but not run_entity_par_reduce or vice-versa, and same with the run_faceelem variants. Also, some .C files call a runentity variant but not run_faceelem and vice-versa. So if we split up NgpLoopUtils.h so that each .C file only includes what it needs, that might help.
@alanw0 I did some experiments on build times for src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
. Just compiling that file alone again and again gives approximately ~122s
for the build in .ninja_log
.
Next, I removed all the templated functions in include/ngp_utils/NgpLoopUtils.h
except run_face_elem_algorithm
(and things called by it). This still gave build time for this file around ~120s
.
Then, after resetting back to master
, I commented out this line https://github.com/Exawind/nalu-wind/blob/78ebed21e86e90c039d4765a9c325fe049d890eb/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C#L159 and attempted a build again. That gave me a build time of ~40.8s
.
Applying this following patch (on a clean master
) and building took ~70s
.
diff --git a/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C b/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
index f8ae4e3..98ae4c4 100644
--- a/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
+++ b/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
@@ -156,7 +156,8 @@ NodalGradPOpenBoundary<AlgTraits>::execute()
);
}
-INSTANTIATE_KERNEL_FACE_ELEMENT(NodalGradPOpenBoundary)
+//INSTANTIATE_KERNEL_FACE_ELEMENT(NodalGradPOpenBoundary)
+template class NodalGradPOpenBoundary<AlgTraitsQuad4Hex8>;
} // namespace nalu
} // namespace Sierra
The following patch took ~105.17s
diff --git a/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C b/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
index f8ae4e3..aca67fa 100644
--- a/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
+++ b/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
@@ -156,7 +156,7 @@ NodalGradPOpenBoundary<AlgTraits>::execute()
);
}
-INSTANTIATE_KERNEL_FACE_ELEMENT(NodalGradPOpenBoundary)
+INSTANTIATE_KERNEL_FACE_ELEMENT_3D(NodalGradPOpenBoundary)
} // namespace nalu
} // namespace Sierra
@sayerhs that's interesting. In general, are we instantiating more than we need to? For debugging a particular case, can you still link if you only instantiate for the element type that you know you will need?
@alanw0 Since we use stk::mesh::Part::topology()
at runtime to determine which exact instantiation we will use, we will need all the possible combinations instantiated. See the following functions for example
What is curious to me is when I commented out the line https://github.com/Exawind/nalu-wind/blob/78ebed21e86e90c039d4765a9c325fe049d890eb/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C#L159 it still took ~40s
to compile that file. That seems a bit high. So I am wondering if we are pulling in way too many headers than we need to.
Another thing to do would be to follow the Tpetra idea of Alg_{def,decl,fwd}.hpp
and then use separate files for ETI. This will allow ninja to do more parallel builds and will prevent summit from killing cicc
because it is consuming too much memory per file.
@sayerhs a quick question regarding the build time, can you see if there are weak symbols in the library and if so how many? These indicate that some function or class is being recompiled in each file even if it already exists with the exact same template parameters somewhere else. We had that issue in MueLu, fixing it reduced both build time and size.
@lucbv that's a good suggestion. I'll take a look at it.
Closing for now due to lack of updates. We can re-open if this is still a problem.
I am creating this issue to track nalu-wind build times on ORNL Summit GPU builds. Currently, Summit login nodes will kill a nalu-wind build job if it uses more than 8 parallel jobs (i.e.,
-j 8
or lower will succeed, but-j 16
or such will die withFor comparison, this doesn't happen when building trilinos where I am able to use a default of
-j 16
.A fresh build of
nalu-wind
on summit takes approx ~45 mins for 546 files when restricted to 8 parallel jobs. Using the.ninja_log
here is a summary of the timingsNalu-Wind build stats
And for comparison, here are the stats for Trilinos build with the same set of modules and configuration:
Trilinos build stats
nalu-wind-timings.txt trilinos-timings.txt