DARMA-tasking / vt

DARMA/vt => Virtual Transport
Other
35 stars 9 forks source link

Figure out why build/test times increased #1331

Closed lifflander closed 3 years ago

lifflander commented 3 years ago

What Needs to be Done?

Research what happened here to solve the problem.

jstrzebonski commented 3 years ago

gcc -ftime-report might be useful.

jstrzebonski commented 3 years ago

Also this: https://github.com/mikael-s-persson/templight although it requires compiling clang from the source.

cz4rs commented 3 years ago

gcc -ftime-report might be useful.

a couple of promising possibilities here:

JacobDomagala commented 3 years ago

We have to keep in mind that before Production mode PR got merged, we were building vt:develop with debug printing disabled, so I would assume that's the main reason.

Looking at the termination.cc file compilation ( Ubuntu 20.04 Clang10 with -ftime-trace flag enabled ):

Release mode with vt_production_build_enabled=0

Compilation time - 37 sec image

Release mode with vt_production_build_enabled=1

Compilation time - ~16 sec image

lifflander commented 3 years ago

We have to keep in mind that before Production mode PR got merged, we were building vt:develop with debug printing disabled, so I would assume that's the main reason.

Production mode just enables assertions and debug prints (AFAIK). Assertions are fairly sparse so it must be the debug prints taking all the time? Can we confirm that hypothesis? It's still very surprising to me that the debug prints would be adding seconds to the build time!

@cz4rs @JacobDomagala

JacobDomagala commented 3 years ago

More isolated example:

Test environment

int main(int /argc/, char* /argv*/) {

for(int i = 0; i < 20; ++i) { vt_debug_print( barrier, node, "Dummy string: val={}, val2={}\n", 10, 2 ); } }



## Production mode = ON
![image](https://user-images.githubusercontent.com/9077677/111884860-8dc49c80-89c4-11eb-89ef-0bc3cac289f9.png)
**Total compile time** = 872ms 
**Time spent on:** Mostly on Frontend, meaning parsing include files. 

----------------------------------------------------------------------------
## Production mode = OFF
![image](https://user-images.githubusercontent.com/9077677/111884958-16dbd380-89c5-11eb-942a-bb6b40c57799.png)
**Total compile time** = 3.1 sec
**Time spent on:** Time spent on Frontent is roughly the same as before. Most time was spent on Backend part, which I assume is optimization. Looking into the pink vertical bars that represent the actual functions, we can see that most of them are FMT functions and DebugPrint calls. 
cz4rs commented 3 years ago

Some additional statistics (compilation times for termination.cc using clang-10):

Additionally, I compiled termination.cc at 9ec8955 (last commit before production mode PR). At that point, debug prints were disabled in release builds, but assertions (and soft errors) could be enabled using vt_ci_build option.

-#if !vt_debug_enabled -using DefaultConfig = Configuration<

lifflander commented 3 years ago

Thanks for all the detailed performance analysis @cz4rs and @JacobDomagala!

Maybe we can discuss this more in the Tuesday meeting. I'm thinking that it might be related to the templates for vt_debug_print. A long time ago, debugging prints were implemented with macros, which had some issues so we switched to templates. We might be able to simplify that code.. not sure.

I'm a little confused why assertions would add 20%. That's a pretty light weight macro!

cz4rs commented 3 years ago

This comes with an idea: https://github.com/fmtlib/fmt/issues/1537.

TODO: check if using core.h instead of format.h speeds up things.

--- a/src/vt/configs/debug/debug_print.h
+++ b/src/vt/configs/debug/debug_print.h
@@ -51,7 +51,7 @@
 #include "vt/configs/debug/debug_var_unused.h"

-#include <fmt/format.h>
+#include <fmt/core.h>

edit: Switching to core.h doesn't work - this is officially discouraged: https://github.com/fmtlib/fmt/blob/master/doc/api.rst#core-api and breaks the build anyways.

JacobDomagala commented 3 years ago

https://github.com/aras-p/ClangBuildAnalyzer looks like something that can be really useful. It also uses -ftime-trace and generates really nice raport. I will try it locally and see how it works.

JacobDomagala commented 3 years ago

ClangBuildAnalyzer output:

Click to see output **** Time summary: Compilation (772 times):
Parsing (frontend): 2483.1 s
Codegen & opts (backend): 4093.0 s

**** Files that took longest to parse (compiler frontend): **18892 ms:** vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_construct_no_idx.extended.cc.o **16705 ms:** vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_broadcast.extended.cc.o **16697 ms:** vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_send.extended.cc.o **14164 ms:** vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_index_types.extended.cc.o **12374 ms:** vt/tests/CMakeFiles/rdma_extended.dir/unit/rdma/test_rdma_collection_handle.extended.cc.o **12258 ms:** vt/tests/CMakeFiles/collection_basic.dir/unit/collection/test_construct.cc.o **11504 ms:** vt/tests/CMakeFiles/termination_basic.dir/unit/termination/test_term_dep_send_chain.cc.o **11368 ms:** vt/examples/CMakeFiles/tutorial.dir/tutorial.cc.o **10842 ms:** vt/tests/CMakeFiles/collection_basic.dir/unit/collection/test_broadcast.cc.o **10791 ms:** vt/src/CMakeFiles/vt.dir/vt/runtime/runtime.cc.o **** Files that took longest to codegen (compiler backend): 127238 ms: vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_construct_no_idx.extended.cc.o 108333 ms: vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_broadcast.extended.cc.o 104071 ms: vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_send.extended.cc.o 90193 ms: vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_index_types.extended.cc.o 61630 ms: vt/examples/CMakeFiles/tutorial.dir/tutorial.cc.o 60264 ms: vt/tests/CMakeFiles/rdma_extended.dir/unit/rdma/test_rdma_collection_handle.extended.cc.o 58095 ms: vt/tests/CMakeFiles/collection_basic.dir/unit/collection/test_construct.cc.o 57438 ms: vt/tests/CMakeFiles/termination_basic.dir/unit/termination/test_term_dep_send_chain.cc.o 53974 ms: vt/tests/CMakeFiles/collection_basic.dir/unit/collection/test_reduce_collection.cc.o 53819 ms: vt/examples/collection/CMakeFiles/transpose.dir/transpose.cc.o **** Templates that took longest to instantiate: 73224 ms: checkpoint::instantiateObjSerializer (383 times, avg 172 ms) 53602 ms: checkpoint::dispatch::vrt::InstantiateTupleHelper::EntityLocationCoord (484 times, avg 66 ms) 27856 ms: std::make_unique> (242 times, avg 115 ms) 22892 ms: checkpoint::dispatch::vrt::InstantiateTupleHelper::EntityLocationCoord (484 times, avg 38 ms) 18543 ms: checkpoint::dispatch::vrt::InstantiateIfPossible (297 times, avg 58 ms) 17261 ms: vt::debug::assert::assertOutInfo (22054 times, avg 11 ms) 231377 ms: std::_Hashtable<$> (23121 times, avg 10 ms) 142904 ms: std::unique_ptr<$> (14548 times, avg 9 ms) 137308 ms: std::__detail::_Hashtable_alloc<$> (23121 times, avg 5 ms) 125188 ms: std::__uniq_ptr_impl<$> (14548 times, avg 8 ms) 115417 ms: std::pair<$> (28775 times, avg 4 ms) 109155 ms: std::tuple<$> (27947 times, avg 3 ms) 107450 ms: std::__detail::_Hash_node<$> (22079 times, avg 4 ms) 106905 ms: std::__detail::_Hash_node_value_base<$> (22073 times, avg 4 ms) 105848 ms: __gnu_cxx::__aligned_buffer<$> (22071 times, avg 4 ms) 93448 ms: std::__and_<$> (48630 times, avg 1 ms) 92344 ms: checkpoint::dispatch::Traverse::with<$> (4917 times, avg 18 ms) 90844 ms: checkpoint::instantiateObjSerializer<$> (605 times, avg 150 ms) 90799 ms: checkpoint::dispatch::vrt::InstantiateTupleHelper<$>::_recur_register (605 times, avg 150 ms) 90404 ms: checkpoint::dispatch::vrt::instantiateObjSerializerImpl<$> (4235 times, avg 21 ms) 84829 ms: checkpoint::dispatch::SerializerDispatchNonByte<$>::apply<$> (3811 times, avg 22 ms) 81643 ms: checkpoint::dispatch::SerializerDispatchNonByte<$>::applyStatic<$> (3231 times, avg 25 ms) 80440 ms: checkpoint::dispatch::BasicDispatcher<$>::serializeIntrusive (3104 times, avg 25 ms) 79726 ms: vt::config::ApplyOp<$>::apply<$> (13185 times, avg 6 ms) 75792 ms: vt::config::CheckEnabled<$>::apply<$> (13185 times, avg 5 ms) 75266 ms: std::make_unique<$> (1557 times, avg 48 ms) 75084 ms: checkpoint::dispatch::vrt::InstantiateIfPossible<$>::registerIt (4235 times, avg 17 ms) 74087 ms: checkpoint::dispatch::vrt::serializer_registry::makeObjIdx<$> (4235 times, avg 17 ms) 73239 ms: vt::config::DispatchOp<$>::apply<$> (13185 times, avg 5 ms) 72832 ms: fmt::v7::format<$> (15680 times, avg 4 ms) 72702 ms: checkpoint::dispatch::vrt::serializer_registry::Registrar<$>::Regist... (4235 times, avg 17 ms) 68092 ms: vt::config::debugPrintImpl<$> (13185 times, avg 5 ms) 65906 ms: fmt::v7::detail::vformat_to<$> (383 times, avg 172 ms) 61303 ms: std::make_tuple<$> (6894 times, avg 8 ms) 58742 ms: testing::internal::TypeParameterizedTestSuite<$>::Register (21 times, avg 2797 ms) **** Functions that took longest to compile: 14487 ms: void fmt::v7::detail::fallback_format(double, int, bool, fmt... (/vt/tests/unit/main.cc) 7531 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (/vt/tests/unit/main.cc) 7348 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (/vt/tests/unit/main.cc) 7183 ms: void fmt::v7::detail::fallback_format(long double, int,... (/vt/tests/unit/main.cc) 5411 ms: std::enable_if<(std::tuple_size >::va... (/vt/tests/unit/main.cc) 4736 ms: fmt::v7::detail::vformat[abi:cxx11](fmt::v7::basic_string_view... (/vt/tests/unit/main.cc) 4578 ms: fmt::v7::detail::dragonbox::decimal_fp fmt::v7::detail::drag... (/vt/tests/unit/main.cc) 4492 ms: void vt::config::debugPrintImpl<(vt::config::CatEnum)512, (vt::confi... (/vt/tests/unit/main.cc) 4458 ms: char const* fmt::v7::detail::parse_replacement_field(double, int, fmt::v7::deta... (/vt/tests/unit/main.cc) 4320 ms: void vt::config::debugPrintImpl<(vt::config::CatEnum)512, (vt::confi... (/vt/tests/unit/main.cc) 4035 ms: fmt::v7::detail::int_writer, ... (/vt/tests/unit/main.cc) 3576 ms: fmt::v7::detail::int_writer, ... (/vt/tests/unit/main.cc) 3550 ms: fmt::v7::detail::int_writer, ... (/vt/tests/unit/main.cc) 3493 ms: void fmt::v7::detail::vformat_to(fmt::v7::detail::buffer... (/vt/tests/unit/main.cc) 3144 ms: std::back_insert_iterator(double, int, fmt::v7::de... (/vt/tests/unit/main.cc) 2340 ms: int fmt::v7::detail::snprintf_float(long double, int, f... (/vt/tests/unit/main.cc) 2036 ms: fmt::v7::detail::buffer_appender fmt::v7::detail::write_float<... (/vt/tests/unit/main.cc) 2012 ms: fmt::v7::detail::dragonbox::decimal_fp fmt::v7::detail::drago... (/vt/tests/unit/main.cc) 1981 ms: fmt::v7::detail::buffer_appender fmt::v7::detail::write_float<... (/vt/tests/unit/main.cc) 1978 ms: fmt::v7::detail::buffer_appender fmt::v7::detail::write_float<... (/vt/tests/unit/main.cc) 1943 ms: std::back_insert_iterator fmt::v7::detail::write(double, int, bool, fmt::v7:... (514 times, avg 105 ms) 53896 ms: std::enable_if<$>::type vt::debug::assert::assertOutInfo<$>(bool, st... (765 times, avg 70 ms) 41260 ms: fmt::v7::detail::int_writer<$>::on_num() (794 times, avg 51 ms) 26246 ms: void fmt::v7::detail::fallback_format<$>(long double, int, bool, fmt... (513 times, avg 51 ms) 26037 ms: checkpoint::dispatch::vrt::serializer_registry::Registrar<$>::Regist... (3043 times, avg 8 ms) 24260 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (187 times, avg 129 ms) 23745 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (186 times, avg 127 ms) 22564 ms: std::back_insert_iterator<$> fmt::v7::detail::write_float<$>(std::ba... (485 times, avg 46 ms) 20780 ms: vt::util::error::PrinterNameValue<$>::make[abi:cxx11](std::tuple<$> ... (2343 times, avg 8 ms) 19360 ms: std::_Hashtable<$>::_M_insert_unique_node(unsigned long, unsigned lo... (1686 times, avg 11 ms) 17910 ms: fmt::v7::detail::vformat[abi:cxx11](fmt::v7::basic_string_view<$>, f... (227 times, avg 78 ms) 17215 ms: char const* fmt::v7::detail::parse_replacement_field<$>(char const*,... (229 times, avg 75 ms) 16641 ms: fmt::v7::detail::dragonbox::decimal_fp<$> fmt::v7::detail::dragonbox... (243 times, avg 68 ms) 16034 ms: int fmt::v7::detail::format_float<$>(double, int, fmt::v7::detail::f... (247 times, avg 64 ms) 15998 ms: void vt::config::debugPrintImpl<$>(short, char const (&) [27], void*... (458 times, avg 34 ms) 14921 ms: void vt::config::debugPrintImpl<$>(short, char const (&) [30], void*... (458 times, avg 32 ms) 14611 ms: vt::auto_registry::RegistrarGen<$>::RegistrarGen() (2216 times, avg 6 ms) 14557 ms: fmt::v7::detail::buffer_appender<$> fmt::v7::detail::write_float<$>(... (495 times, avg 29 ms) 13346 ms: void fmt::v7::detail::vformat_to<$>(fmt::v7::detail::buffer<$>&, fmt... (225 times, avg 59 ms) 12536 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (135 times, avg 92 ms) 12132 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (125 times, avg 97 ms) 11860 ms: std::_Function_base::_Base_manager<$>::_M_manager(std::_Any_data&, s... (4590 times, avg 2 ms) 11767 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (129 times, avg 91 ms) 11573 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)8, (vt::config::C... (125 times, avg 92 ms) 10910 ms: fmt::v7::detail::format_handler<$>::on_format_specs(int, char const*... (227 times, avg 48 ms) 10899 ms: vt::messaging::MsgSharedPtr<$>::get() const (1292 times, avg 8 ms) 10029 ms: std::back_insert_iterator<$> fmt::v7::detail::write_float<$>(std::ba... (245 times, avg 40 ms) 8968 ms: int fmt::v7::detail::snprintf_float<$>(double, int, fmt::v7::detail:... (288 times, avg 31 ms) 8709 ms: void std::vector<$>::_M_realloc_insert<$>(__gnu_cxx::__normal_iterat... (437 times, avg 19 ms) 8588 ms: void vt::collective::reduce::Reduce::startReduce<$>(vt::util::adt::A... (181 times, avg 47 ms) *** Expensive headers: 1018992 ms: /vt/src/vt/transport.h (included 228 times, avg 4469 ms), included via: test_model_naive_persistence.nompi.cc.o (5782 ms) test_model_multiple_phases.nompi.cc.o (5728 ms) test_checkpoint.extended.cc.o (5618 ms) tutorial.cc.o (5414 ms) callback_context.cc.o (5348 ms) callback.cc.o (5289 ms) ... 700700 ms: /vt/tests/unit/test_harness.h (included 199 times, avg 3521 ms), included via: test_termination_action_common.cc.o test_parallel_harness.h (5147 ms) main.cc.o (5064 ms) main.cc.o (4985 ms) main.cc.o (4935 ms) test_storage.cc.o test_parallel_harness.h (4893 ms) test_invoke.cc.o test_parallel_harness.h (4886 ms) ... 328477 ms: /vt/src/vt/vrt/collection/collection_headers.h (included 229 times, avg 1434 ms), included via: main.cc.o test_harness.h data_message.h transport.h (1857 ms) test_mpi_collective.cc.o transport.h (1812 ms) test_model_multiple_phases.nompi.cc.o transport.h (1737 ms) test_demangler_utils.nompi.cc.o transport.h (1724 ms) main.cc.o test_harness.h data_message.h transport.h (1683 ms) test_broadcast.cc.o test_parallel_harness.h test_harness.h data_message.h transport.h (1682 ms) ... 304418 ms: /vt/src/vt/vrt/collection/manager.h (included 260 times, avg 1170 ms), included via: manager.cc.o (2639 ms) scheduler.cc.o (1741 ms) test_mpi_collective.cc.o transport.h collection_headers.h (1691 ms) main.cc.o test_harness.h data_message.h transport.h collection_headers.h (1680 ms) node_stats.cc.o (1645 ms) test_demangler_utils.nompi.cc.o transport.h collection_headers.h (1584 ms) ... 278523 ms: /vt/tests/unit/test_parallel_harness.h (included 181 times, avg 1538 ms), included via: test_termination_action_common.cc.o (5551 ms) test_invoke.cc.o (5325 ms) test_storage.cc.o (5306 ms) test_objgroup.cc.o test_objgroup_common.h (5193 ms) test_invoke.cc.o (5168 ms) test_broadcast.cc.o (4884 ms) ... 242186 ms: /vt/src/vt/config.h (included 382 times, avg 633 ms), included via: args.cc.o (1391 ms) context_vrtmanager.cc.o (1385 ms) rotatelb.cc.o (1328 ms) context_vrtmessage.cc.o context_vrtmessage.h (1214 ms) collective_alg.cc.o collective_alg.h (1193 ms) test_model_norm.nompi.cc.o transport.h (1181 ms) ... 228364 ms: /vt/src/vt/collective/collective_alg.h (included 273 times, avg 836 ms), included via: collective_alg.cc.o (3405 ms) trace.cc.o (2894 ms) collective_scope.cc.o (1869 ms) test_group.cc.o test_parallel_harness.h test_harness.h data_message.h transport.h (1172 ms) test_demangler_utils.nompi.cc.o transport.h (1171 ms) test_hops.extended.cc.o test_parallel_harness.h test_harness.h data_message.h transport.h (1168 ms) ... 226110 ms: /vt/src/vt/pool/pool.h (included 330 times, avg 685 ms), included via: rotatelb.cc.o rotatelb.h message.h message.h (1128 ms) basic.cc.o startup.h runtime_headers.h runtime_inst.h runtime_inst.impl.h runtime.h runtime_component_fwd.h sequencer_fwd.h seq_common.h activefn.h message.h message.h (1044 ms) context_vrt_attorney.cc.o context_vrt_attorney.h context_vrt.h context_vrt_fwd.h message.h message.h (1039 ms) test_active_send_large.cc.o transport.h (937 ms) test_model_naive_persistence.nompi.cc.o transport.h (924 ms) test_callback_bcast_collection.extended.cc.o test_parallel_harness.h test_harness.h data_message.h transport.h (924 ms) ... 198945 ms: /vt/src/vt/messaging/active.h (included 298 times, avg 667 ms), included via: rdma_collection.cc.o (2420 ms) async_op.cc.o (2363 ms) trace_user_event.cc.o (1597 ms) active.cc.o (1585 ms) epoch_tags.cc.o (1391 ms) collective_alg.cc.o collective_alg.h default_msg.h default_op.h pipe_callback_only.h cb_raw_base.h cb_raw.h callback_send_tl.h callback_send_tl.impl.h (1034 ms) ... 192750 ms: /vt/src/vt/pipe/callback/cb_union/cb_raw_base.h (included 276 times, avg 698 ms), included via: cb_raw_base.cc.o (1584 ms) collective_alg.cc.o collective_alg.h default_msg.h default_op.h pipe_callback_only.h (1104 ms) rotatelb.cc.o rotatelb.h baselb.h baselb_msgs.h reduce.h default_msg.h default_op.h pipe_callback_only.h (1043 ms) main.cc.o test_harness.h data_message.h transport.h collective_alg.h default_msg.h default_op.h pipe_callback_only.h (920 ms) test_rdma_collection_handle.extended.cc.o transport.h collective_alg.h default_msg.h default_op.h pipe_callback_only.h (912 ms) hello_world_virtual_context_remote.cc.o transport.h collective_alg.h default_msg.h default_op.h pipe_callback_only.h (910 ms) ...

Edit: Updated the output with the raport for entire project (including tests and examples), not only /vt/src

cz4rs commented 3 years ago

Possibly related: https://github.com/DARMA-tasking/vt/issues/247.

lifflander commented 3 years ago

Here's my latest profile analysis using the changes to checkpoint to explicitly instantiate the templates for DiagnosticBase polymorphic serialization code.

Click to see output root@75f7345a504c:/build/ClangBuildAnalyzer-build# ./ClangBuildAnalyzer --stop /build/vt capture-21-3-19-01 Stopping build tracing and saving to 'capture-21-3-19-01'... done in 51.1s. Run 'ClangBuildAnalyzer --analyze capture-21-3-19-01' to analyze it. root@75f7345a504c:/build/ClangBuildAnalyzer-build# ./ClangBuildAnalyzer --analyze capture-21-3-19-01 root@75f7345a504c:/build/ClangBuildAnalyzer-build# ./ClangBuildAnalyzer --analyze capture-21-3-19-01 Analyzing build trace from 'capture-21-3-19-01'... **** Time summary: Compilation (397 times): Parsing (frontend): 9580.8 s Codegen & opts (backend): 15244.8 s **** Files that took longest to parse (compiler frontend): 64214 ms: /build/vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_construct_no_idx.extended.cc.o 60100 ms: /build/vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_broadcast.extended.cc.o 58688 ms: /build/vt/src/CMakeFiles/vt.dir/vt/runtime/runtime.cc.o 54419 ms: /build/vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_index_types.extended.cc.o 53987 ms: /build/vt/tests/CMakeFiles/pool_basic.dir/unit/pool/test_pool.cc.o 53788 ms: /build/vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_send.extended.cc.o 48720 ms: /build/vt/tests/CMakeFiles/mapping_basic.dir/unit/main.cc.o 47522 ms: /build/vt/tests/CMakeFiles/mapping_basic.dir/unit/mapping/test_mapping_registry.cc.o 47515 ms: /build/vt/tests/CMakeFiles/location_basic.dir/unit/location/test_location.cc.o 46389 ms: /build/vt/tests/CMakeFiles/location_basic.dir/unit/main.cc.o **** Files that took longest to codegen (compiler backend): 541136 ms: /build/vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_construct_no_idx.extended.cc.o 471245 ms: /build/vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_send.extended.cc.o 430484 ms: /build/vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_broadcast.extended.cc.o 303072 ms: /build/vt/tests/CMakeFiles/collection_extended.dir/unit/collection/test_index_types.extended.cc.o 241031 ms: /build/vt/src/CMakeFiles/vt.dir/vt/runtime/component/diagnostic_value.cc.o 231704 ms: /build/vt/tests/CMakeFiles/rdma_extended.dir/unit/rdma/test_rdma_collection_handle.extended.cc.o 196937 ms: /build/vt/tests/CMakeFiles/termination_basic.dir/unit/termination/test_term_dep_send_chain.cc.o 192208 ms: /build/vt/tests/CMakeFiles/rdma_extended.dir/unit/rdma/test_rdma_static_sub_handle.extended.cc.o 188317 ms: /build/vt/examples/CMakeFiles/tutorial.dir/tutorial.cc.o 179659 ms: /build/vt/tests/CMakeFiles/collection_basic.dir/unit/collection/test_construct.cc.o **** Templates that took longest to instantiate: 204970 ms: fmt::v7::detail::vformat_to (394 times, avg 520 ms) 122283 ms: fmt::v7::detail::format_handler::EntityLocationCoord (488 times, avg 233 ms) 89217 ms: std::make_unique> (244 times, avg 365 ms) 83453 ms: checkpoint::instantiateObjSerializer::EntityLocationCoord (488 times, avg 150 ms) 62007 ms: checkpoint::dispatch::vrt::InstantiateTupleHelper (306 times, avg 167 ms) 51029 ms: vt::debug::assert::assertOutInfo, std::unique_ptr... (360 times, avg 117 ms) 41051 ms: vt::location::EntityLocationCoord (244 times, avg 168 ms) 41040 ms: std::_Hashtable, std::pair> (244 times, avg 164 ms) 36875 ms: std::unordered_map (38 times, avg 949 ms) 36081 ms: checkpoint::dispatch::vrt::InstantiateTupleHelper (394 times, avg 89 ms) 35423 ms: vt::runnable::RunnableNew::addContext (305 times, avg 116 ms) 35110 ms: fmt::v7::to_string (394 times, avg 89 ms) 34294 ms: std::__detail::_Hashtable_alloc (309809 times, avg 4 ms) 1075079 ms: std::unique_ptr<$> (15914 times, avg 67 ms) 1056601 ms: std::__or_<$> (579792 times, avg 1 ms) 1002542 ms: std::__uniq_ptr_impl<$> (15914 times, avg 62 ms) 987505 ms: std::unordered_map<$> (22623 times, avg 43 ms) 926549 ms: std::_Hashtable<$> (23717 times, avg 39 ms) 877722 ms: std::is_constructible<$> (162379 times, avg 5 ms) 858600 ms: std::__is_constructible_impl<$> (161063 times, avg 5 ms) 793484 ms: std::__is_direct_constructible<$> (140234 times, avg 5 ms) 779052 ms: std::__is_direct_constructible_new<$> (138601 times, avg 5 ms) 718817 ms: std::__is_referenceable<$> (377072 times, avg 1 ms) 704378 ms: std::add_rvalue_reference<$> (333042 times, avg 2 ms) 682645 ms: std::__is_direct_constructible_new_safe<$> (123243 times, avg 5 ms) 592883 ms: std::__not_<$> (375928 times, avg 1 ms) 568994 ms: std::is_object<$> (358618 times, avg 1 ms) 564433 ms: std::__detail::_Hashtable_alloc<$> (23717 times, avg 23 ms) 562385 ms: std::pair<$> (29195 times, avg 19 ms) 500316 ms: std::__detail::_Hash_node<$> (23442 times, avg 21 ms) 497235 ms: std::__detail::_Hash_node_value_base<$> (23335 times, avg 21 ms) 492546 ms: __gnu_cxx::__aligned_buffer<$> (23517 times, avg 20 ms) 435113 ms: std::is_destructible<$> (74570 times, avg 5 ms) 393506 ms: std::_TC<$>::_ConstructibleTuple<$> (25815 times, avg 15 ms) 382204 ms: std::tuple<$> (28546 times, avg 13 ms) 336951 ms: std::__is_direct_constructible_impl<$> (109291 times, avg 3 ms) 268133 ms: std::make_unique<$> (4649 times, avg 57 ms) 267963 ms: std::is_convertible<$> (133924 times, avg 2 ms) 246773 ms: vt::config::ApplyOp<$>::apply<$> (12557 times, avg 19 ms) 236332 ms: std::_Tuple_impl<$> (28536 times, avg 8 ms) 236215 ms: std::make_tuple<$> (6113 times, avg 38 ms) 234425 ms: vt::config::CheckEnabled<$>::apply<$> (12557 times, avg 18 ms) **** Functions that took longest to compile: 40500 ms: void fmt::v7::detail::fallback_format(double, int, bool, fmt... (/vt/tests/unit/main.cc) 21640 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (/vt/tests/unit/main.cc) 20273 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (/vt/tests/unit/main.cc) 18926 ms: void fmt::v7::detail::fallback_format(long double, int,... (/vt/tests/unit/main.cc) 14395 ms: void vt::debug::assert::assertOutInfo... (/vt/tests/unit/main.cc) 13606 ms: void vt::config::debugPrintImpl<(vt::config::CatEnum)512, (vt::confi... (/vt/tests/unit/main.cc) 13382 ms: char const* fmt::v7::detail::parse_replacement_field(double, int, fmt::v7::deta... (/vt/tests/unit/main.cc) 12639 ms: void vt::config::debugPrintImpl<(vt::config::CatEnum)512, (vt::confi... (/vt/tests/unit/main.cc) 11765 ms: fmt::v7::detail::dragonbox::decimal_fp fmt::v7::detail::drag... (/vt/tests/unit/main.cc) 11736 ms: fmt::v7::detail::int_writer, ... (/vt/tests/unit/main.cc) 10677 ms: fmt::v7::detail::int_writer, ... (/vt/tests/unit/main.cc) 10518 ms: void fmt::v7::detail::vformat_to(fmt::v7::detail::buffer... (/vt/tests/unit/main.cc) 10506 ms: fmt::v7::detail::int_writer, ... (/vt/tests/unit/main.cc) 9619 ms: std::back_insert_iterator(long double, int, f... (/vt/tests/unit/main.cc) 7264 ms: int fmt::v7::detail::snprintf_float(double, int, fmt::v7::de... (/vt/tests/unit/main.cc) 6932 ms: vt::runtime::Runtime::printStartupBanner() (/vt/src/vt/runtime/runtime_banner.cc) 6233 ms: fmt::v7::detail::dragonbox::decimal_fp fmt::v7::detail::drago... (/vt/tests/unit/main.cc) 6024 ms: fmt::v7::detail::buffer_appender fmt::v7::detail::write_float<... (/vt/tests/unit/main.cc) 5823 ms: std::back_insert_iterator fmt::v7::detail::write_float<... (/vt/tests/unit/main.cc) 5632 ms: fmt::v7::detail::buffer_appender fmt::v7::detail::write_float<... (/vt/tests/unit/main.cc) 5488 ms: void std::vector(double, int, bool, fmt::v7:... (546 times, avg 280 ms) 137681 ms: void vt::debug::assert::assertOutInfo<$>(bool, std::__cxx11::basic_s... (1530 times, avg 89 ms) 125416 ms: fmt::v7::detail::int_writer<$>::on_num() (1637 times, avg 76 ms) 83502 ms: std::_Function_base::_Base_manager<$>::_M_manager(std::_Any_data&, s... (5234 times, avg 15 ms) 77349 ms: void fmt::v7::detail::fallback_format<$>(long double, int, bool, fmt... (546 times, avg 141 ms) 73473 ms: std::back_insert_iterator<$> fmt::v7::detail::write_float<$>(std::ba... (1092 times, avg 67 ms) 67691 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (203 times, avg 333 ms) 67487 ms: std::_Hashtable<$>::_M_insert_unique_node(unsigned long, unsigned lo... (2094 times, avg 32 ms) 65549 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (205 times, avg 319 ms) 59580 ms: vt::auto_registry::RegistrarGen<$>::RegistrarGen() (2454 times, avg 24 ms) 53813 ms: fmt::v7::detail::vformat[abi:cxx11](fmt::v7::basic_string_view<$>, f... (316 times, avg 170 ms) 52877 ms: vt::util::error::PrinterNameValue<$>::make[abi:cxx11](std::tuple<$> ... (1831 times, avg 28 ms) 50323 ms: char const* fmt::v7::detail::parse_replacement_field<$>(char const*,... (428 times, avg 117 ms) 49598 ms: int fmt::v7::detail::format_float<$>(double, int, fmt::v7::detail::f... (545 times, avg 91 ms) 47061 ms: fmt::v7::detail::buffer_appender<$> fmt::v7::detail::write_float<$>(... (1092 times, avg 43 ms) 46216 ms: checkpoint::dispatch::vrt::serializer_registry::Registrar<$>::Regist... (1324 times, avg 34 ms) 45342 ms: fmt::v7::detail::dragonbox::decimal_fp<$> fmt::v7::detail::dragonbox... (545 times, avg 83 ms) 43425 ms: void vt::config::debugPrintImpl<$>(short, char const (&) [30], void*... (464 times, avg 93 ms) 43410 ms: vt::messaging::MsgSharedPtr<$>::get() const (1568 times, avg 27 ms) 43246 ms: void vt::config::debugPrintImpl<$>(short, char const (&) [27], void*... (464 times, avg 93 ms) 39558 ms: void fmt::v7::detail::vformat_to<$>(fmt::v7::detail::buffer<$>&, fmt... (337 times, avg 117 ms) 33173 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (146 times, avg 227 ms) 32015 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (134 times, avg 238 ms) 31665 ms: fmt::v7::detail::format_handler<$>::on_format_specs(int, char const*... (325 times, avg 97 ms) 31411 ms: __gnu_cxx::new_allocator<$>::~new_allocator() (9106 times, avg 3 ms) 31256 ms: std::back_insert_iterator<$> fmt::v7::detail::write_float<$>(std::ba... (546 times, avg 57 ms) 31105 ms: std::vector<$>::~vector() (2335 times, avg 13 ms) 30964 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)8, (vt::config::C... (132 times, avg 234 ms) 30502 ms: void std::vector<$>::_M_realloc_insert<$>(__gnu_cxx::__normal_iterat... (795 times, avg 38 ms) 30139 ms: void vt::config::DebugPrintOp<(vt::config::CatEnum)512, (vt::config:... (136 times, avg 221 ms) *** Expensive headers: 4558653 ms: /vt/src/vt/transport.h (included 229 times, avg 19906 ms), included via: test_pool.cc.o test_harness.h data_message.h (37910 ms) main.cc.o test_harness.h data_message.h (35439 ms) test_location.cc.o test_location_common.h data_message.h (34602 ms) test_mapping_registry.cc.o test_parallel_harness.h test_harness.h data_message.h (31109 ms) hello_world_functor.cc.o (30652 ms) main.cc.o test_harness.h data_message.h (30152 ms) ... 3168267 ms: /vt/tests/unit/test_harness.h (included 200 times, avg 15841 ms), included via: test_pool.cc.o (38316 ms) main.cc.o (35457 ms) test_mapping_registry.cc.o test_parallel_harness.h (31143 ms) main.cc.o (30184 ms) test_sequencer_nested.extended.cc.o test_parallel_harness.h (28350 ms) main.cc.o (28152 ms) ... 1402606 ms: /vt/src/vt/vrt/collection/collection_headers.h (included 230 times, avg 6098 ms), included via: main.cc.o test_harness.h data_message.h transport.h (12240 ms) main.cc.o test_harness.h data_message.h transport.h (11679 ms) main.cc.o test_harness.h data_message.h transport.h (10113 ms) test_epoch.nompi.cc.o test_harness.h data_message.h transport.h (9900 ms) test_sequencer_nested.extended.cc.o test_parallel_harness.h test_harness.h data_message.h transport.h (9639 ms) test_active_send_put.cc.o test_parallel_harness.h test_harness.h data_message.h transport.h (9522 ms) ... 1311138 ms: /vt/src/vt/vrt/collection/manager.h (included 1404 times, avg 933 ms), included via: manager.cc.o (14927 ms) lb_stats.cc.o (12720 ms) main.cc.o test_harness.h data_message.h transport.h collection_headers.h (10920 ms) main.cc.o test_harness.h data_message.h transport.h collection_headers.h (10670 ms) main.cc.o test_harness.h data_message.h transport.h collection_headers.h (9101 ms) test_epoch.nompi.cc.o test_harness.h data_message.h transport.h collection_headers.h (9090 ms) ... 1255367 ms: /vt/tests/unit/test_parallel_harness.h (included 182 times, avg 6897 ms), included via: test_mapping_registry.cc.o (31323 ms) test_sequencer_nested.extended.cc.o (28523 ms) test_sequencer_for.extended.cc.o (27905 ms) test_sequencer_parallel.extended.cc.o (27823 ms) test_sequencer_extensive.extended.cc.o (26641 ms) test_sequencer_vrt.extended.cc.o (26180 ms) ... 1132904 ms: /vt/src/vt/collective/collective_alg.h (included 286 times, avg 3961 ms), included via: trace.cc.o (13775 ms) collective_alg.cc.o (10556 ms) test_pool.cc.o test_harness.h data_message.h transport.h (10389 ms) main.cc.o test_harness.h data_message.h transport.h (8995 ms) main.cc.o test_harness.h data_message.h transport.h (8035 ms) test_lb_stats_reader.cc.o transport.h (7450 ms) ... 1067801 ms: /vt/src/vt/config.h (included 393 times, avg 2717 ms), included via: group_shallow_list.cc.o (9369 ms) test_location.cc.o test_location_common.h data_message.h transport.h (9317 ms) group_default.cc.o (7930 ms) group_range.cc.o (7509 ms) termination.cc.o (6668 ms) test_model_raw_data.nompi.cc.o transport.h (6461 ms) ... 1055588 ms: /vt/src/vt/pool/pool.h (included 339 times, avg 3113 ms), included via: test_pool.cc.o test_harness.h data_message.h transport.h (9532 ms) test_mapping_registry.cc.o test_parallel_harness.h test_harness.h data_message.h transport.h (6712 ms) test_model_persistence_median_last_n.nompi.cc.o transport.h (6682 ms) termination.cc.o termination.h term_msgs.h message.h message.h (6380 ms) rdma_channel.cc.o rdma_channel.h activefn.h message.h message.h (6032 ms) main.cc.o test_harness.h data_message.h transport.h (5727 ms) ... 1052869 ms: /vt/src/vt/messaging/active.h (included 457 times, avg 2303 ms), included via: rdma_collection.cc.o (13816 ms) async_op.cc.o (11814 ms) trace_user_event.cc.o (9343 ms) epoch_tags.cc.o (8858 ms) phase_manager.cc.o headers.h manager.h manager.fwd.h manager.static.h (7968 ms) test_pool.cc.o test_harness.h data_message.h transport.h collective_alg.h default_msg.h default_op.h pipe_callback_only.h cb_raw_base.h cb_raw.h callback_send_tl.h callback_send_tl.impl.h (7801 ms) ... 998949 ms: /vt/src/vt/pipe/callback/cb_union/cb_raw_base.h (included 302 times, avg 3307 ms), included via: cb_raw_base.cc.o (10630 ms) test_pool.cc.o test_harness.h data_message.h transport.h collective_alg.h default_msg.h default_op.h pipe_callback_only.h (8784 ms) rdma_channel.cc.o rdma.h rdma_msg.h (7545 ms) main.cc.o test_harness.h data_message.h transport.h collective_alg.h default_msg.h default_op.h pipe_callback_only.h (7113 ms) main.cc.o test_harness.h data_message.h transport.h collective_alg.h default_msg.h default_op.h pipe_callback_only.h (7096 ms) test_mapping_registry.cc.o test_parallel_harness.h test_harness.h data_message.h transport.h collective_alg.h default_msg.h default_op.h pipe_callback_only.h (6010 ms) ... done in 42.9s.
JacobDomagala commented 3 years ago

I also suggest using custom .ini file for https://github.com/aras-p/ClangBuildAnalyzer such as:

ClangBuildAnalyzer.ini

# ClangBuildAnalyzer reads ClangBuildAnalyzer.ini file from the working directory
# when invoked, and various aspects of reporting can be configured this way.
# This file example is setup to be exactly like what the defaults are.

# How many of most expensive things are reported?
[counts]

# files that took most time to parse
fileParse = 20
# files that took most time to generate code for
fileCodegen = 20
# functions that took most time to generate code for
function = 50
# header files that were most expensive to include
header = 20
# for each expensive header, this many include paths to it are shown
headerChain = 10
# templates that took longest to instantiate
template = 50

# Minimum times (in ms) for things to be recorded into trace
[minTimes]

# parse/codegen for a file
file = 20

[misc]

# Maximum length of symbol names printed; longer names will get truncated
maxNameLength = 1000

# Only print "root" headers in expensive header report, i.e.
# only headers that are directly included by at least one source file
onlyRootHeaders = true
cz4rs commented 3 years ago

Interesting read from the author of fmt: https://www.zverovich.net/2017/12/09/improving-compile-times.html.

cz4rs commented 3 years ago

reopening for #1347

cz4rs commented 3 years ago

JD`s work on build-stats is merged, so we can see the graphs in all their glory: image

The last drop is after #1374 was merged. Even if we take it with a grain of salt (no isolation for the build environment) this looks pretty decent :slightly_smiling_face:

lifflander commented 3 years ago

We've explored this quite a bit and will continue to try to reduce compile times in more specific ways.