izzys commented 5 years ago

Description

When running the example: TimeTBB.cpp, here are the results:

numberOfProblems = 1000000 problemSize = 4 With 1 threads: Without memory allocation, grain size = 1, time = 0.284984 Without memory allocation, grain size = 10, time = 0.279206 Without memory allocation, grain size = 100, time = 0.256432 Without memory allocation, grain size = 1000, time = 0.253955 With memory allocation, grain size = 1, time = 0.422034 With memory allocation, grain size = 10, time = 0.444783 With memory allocation, grain size = 100, time = 0.437323 With memory allocation, grain size = 1000, time = 0.418359

With 4 threads: Without memory allocation, grain size = 1, time = 4.46345 Without memory allocation, grain size = 10, time = 4.58412 Without memory allocation, grain size = 100, time = 4.66668 Without memory allocation, grain size = 1000, time = 4.60369 With memory allocation, grain size = 1, time = 5.07619 With memory allocation, grain size = 10, time = 5.38483 With memory allocation, grain size = 100, time = 5.23105 With memory allocation, grain size = 1000, time = 5.28864

With 8 threads: Without memory allocation, grain size = 1, time = 5.24027 Without memory allocation, grain size = 10, time = 5.25576 Without memory allocation, grain size = 100, time = 5.2626 Without memory allocation, grain size = 1000, time = 5.25358 With memory allocation, grain size = 1, time = 5.95175 With memory allocation, grain size = 10, time = 5.93275 With memory allocation, grain size = 100, time = 5.92773 With memory allocation, grain size = 1000, time = 5.93785

Summary of results: 4 threads, without allocation, grain size = 1, speedup = 0.0638485 4 threads, without allocation, grain size = 10, speedup = 0.0609071 4 threads, without allocation, grain size = 100, speedup = 0.0549497 4 threads, without allocation, grain size = 1000, speedup = 0.0551635 4 threads, with allocation, grain size = 1, speedup = 0.0831399 4 threads, with allocation, grain size = 10, speedup = 0.0825993 4 threads, with allocation, grain size = 100, speedup = 0.0836012 4 threads, with allocation, grain size = 1000, speedup = 0.0791052 8 threads, without allocation, grain size = 1, speedup = 0.0543836 8 threads, without allocation, grain size = 10, speedup = 0.0531238 8 threads, without allocation, grain size = 100, speedup = 0.0487273 8 threads, without allocation, grain size = 1000, speedup = 0.0483396 8 threads, with allocation, grain size = 1, speedup = 0.0709091 8 threads, with allocation, grain size = 10, speedup = 0.0749709 8 threads, with allocation, grain size = 100, speedup = 0.0737758 8 threads, with allocation, grain size = 1000, speedup = 0.0704562

Steps to reproduce

Just run the example.

Expected behavior

i would expect some speedup, and not a slow down...

Environment

Linux 16.04 Intel i7

Here is my CMAKE output:

-- GTSAM_SOURCE_ROOT_DIR: [/home/izzys/samples/gtsam_samples] -- Boost version: 1.58.0 -- Found the following Boost libraries: -- serialization -- system -- filesystem -- thread -- program_options -- date_time -- timer -- chrono -- regex -- atomic -- GTSAM_BOOST_LIBRARIES: optimized;/usr/lib/x86_64-linux-gnu/libboost_serialization.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_system.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_filesystem.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_thread.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_date_time.so;optimized;/usr/lib/x86_64-linux-gnu/libboost_regex.so;debug;/usr/lib/x86_64-linux-gnu/libboost_serialization.so;debug;/usr/lib/x86_64-linux-gnu/libboost_system.so;debug;/usr/lib/x86_64-linux-gnu/libboost_filesystem.so;debug;/usr/lib/x86_64-linux-gnu/libboost_thread.so;debug;/usr/lib/x86_64-linux-gnu/libboost_date_time.so;debug;/usr/lib/x86_64-linux-gnu/libboost_regex.so Ignoring Boost restriction on optional lvalue assignment from rvalues -- Found Eigen version: 3.3.7 -- Building 3rdparty -- checking for thread-local storage - found -- Could NOT find GeographicLib (missing: GeographicLib_LIBRARY_DIRS GeographicLib_LIBRARIES GeographicLib_INCLUDE_DIRS) -- Building base -- Building geometry -- Building inference -- Building symbolic -- Building discrete -- Building linear -- Building nonlinear -- Building sam -- Building sfm -- Building slam -- Building smart -- Building navigation -- GTSAM Version: 4.0.0 -- Install prefix: /usr/local -- Building GTSAM - shared: ON -- Wrote /home/tc34738/samples/gtsam_samples/gtsam-build/GTSAMConfig.cmake -- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) -- =============================================================== -- ================ Configuration Options ====================== -- CMAKE_CXX_COMPILER_ID type : GNU -- CMAKE_CXX_COMPILER_VERSION : 5.4.0 -- CMake version : 3.5.1 -- CMake generator : Unix Makefiles -- CMake build tool : /usr/bin/make -- Build flags
-- Build Tests : Enabled -- Build examples with 'make all' : Enabled -- Build timing scripts with 'make all': Disabled -- Build shared GTSAM libraries : Enabled -- Put build type in library name : Enabled -- Build libgtsam_unstable : Disabled -- Build for native architecture : Enabled -- Build type : Release -- C compilation flags : -O3 -DNDEBUG -- C++ compilation flags : -O3 -DNDEBUG -- GTSAM_COMPILE_FEATURES_PUBLIC : -- GTSAM_COMPILE_OPTIONS_PRIVATE : -Wall;$<$:-g;-fno-inline>;$<$:-O3>;$<$:-g;-O3>;$<$:-O3>;$<$:-g;-O3>;-Wno-unused-local-typedefs -- GTSAM_COMPILE_OPTIONS_PUBLIC : $<$:-std=c++11>;-march=native -- GTSAM_COMPILE_DEFINITIONS_PRIVATE : $<$:_DEBUG;EIGEN_INITIALIZE_MATRICES_BY_NAN>;$<$:NDEBUG>;$<$:NDEBUG;ENABLE_TIMING>;$<$:NDEBUG>;$<$:NDEBUG> -- GTSAM_COMPILE_DEFINITIONS_PUBLIC : BOOST_OPTIONAL_ALLOW_BINDING_TO_RVALUES;BOOST_OPTIONAL_CONFIG_ALLOW_BINDING_TO_RVALUES -- GTSAM_COMPILE_OPTIONS_PRIVATE_DEBUG : -g;-fno-inline -- GTSAM_COMPILE_OPTIONS_PUBLIC_DEBUG : -- GTSAM_COMPILE_DEFINITIONS_PRIVATE_DEBUG : _DEBUG;EIGEN_INITIALIZE_MATRICES_BY_NAN -- GTSAM_COMPILE_DEFINITIONS_PUBLIC_DEBUG : -- GTSAM_COMPILE_OPTIONS_PRIVATE_RELEASE : -O3 -- GTSAM_COMPILE_OPTIONS_PUBLIC_RELEASE : -- GTSAM_COMPILE_DEFINITIONS_PRIVATE_RELEASE : NDEBUG -- GTSAM_COMPILE_DEFINITIONS_PUBLIC_RELEASE : -- GTSAM_COMPILE_OPTIONS_PRIVATE_TIMING : -g;-O3 -- GTSAM_COMPILE_OPTIONS_PUBLIC_TIMING : -- GTSAM_COMPILE_DEFINITIONS_PRIVATE_TIMING : NDEBUG;ENABLE_TIMING -- GTSAM_COMPILE_DEFINITIONS_PUBLIC_TIMING : -- GTSAM_COMPILE_OPTIONS_PRIVATE_PROFILING : -O3 -- GTSAM_COMPILE_OPTIONS_PUBLIC_PROFILING : -- GTSAM_COMPILE_DEFINITIONS_PRIVATE_PROFILING : NDEBUG -- GTSAM_COMPILE_DEFINITIONS_PUBLIC_PROFILING : -- GTSAM_COMPILE_OPTIONS_PRIVATE_RELWITHDEBINFO : -g;-O3 -- GTSAM_COMPILE_OPTIONS_PUBLIC_RELWITHDEBINFO : -- GTSAM_COMPILE_DEFINITIONS_PRIVATE_RELWITHDEBINFO : NDEBUG -- GTSAM_COMPILE_DEFINITIONS_PUBLIC_RELWITHDEBINFO : -- GTSAM_COMPILE_OPTIONS_PRIVATE_MINSIZEREL : -- GTSAM_COMPILE_OPTIONS_PUBLIC_MINSIZEREL : -- GTSAM_COMPILE_DEFINITIONS_PRIVATE_MINSIZEREL : -- GTSAM_COMPILE_DEFINITIONS_PUBLIC_MINSIZEREL : -- Use System Eigen : OFF (Using version: 3.3.7) -- Use Intel TBB : Yes -- Eigen will use MKL : MKL found but GTSAM_WITH_EIGEN_MKL is disabled -- Eigen will use MKL and OpenMP : OpenMP found but GTSAM_WITH_EIGEN_MKL is disabled -- Default allocator : TBB -- Build with ccache : No -- Packaging flags
-- CPack Source Generator : TGZ -- CPack Generator : TGZ -- GTSAM flags
-- Quaternions as default Rot3 : Disabled -- Runtime consistency checking : Disabled -- Rot3 retract is full ExpMap : Disabled -- Pose3 retract is full ExpMap : Disabled -- Deprecated in GTSAM 4 allowed : Enabled -- Point3 is typedef to Vector3 : Disabled -- Metis-based Nested Dissection : Enabled -- Use tangent-space preintegration: Enabled -- Build Wrap : Disabled -- MATLAB toolbox flags
-- Install matlab toolbox : Disabled -- Cython toolbox flags
-- Install Cython toolbox : Disabled -- =============================================================== -- Configuring done -- Generating done -- Build files have been written to: /home/izzys/samples/gtsam_samples/gtsam-build

dellaert commented 5 years ago

@MandyXie could you try to reproduce?

MandyXie commented 5 years ago

I ran the example, and got the same issue as you mentioned. I will look into it, and try to figure out what is going on.

ProfFan commented 4 years ago

Side note: We can try to integrate a flamegraph library into GTSAM possibly replacing the gttic/toc machinery.

ProfFan commented 4 years ago

121

ProfFan commented 4 years ago

My results on macOS 10.14:

numberOfProblems = 1000000
problemSize = 4
With 1 threads:
Without memory allocation, grain size = 1, time = 0.150485
Without memory allocation, grain size = 10, time = 0.15183
Without memory allocation, grain size = 100, time = 0.149489
Without memory allocation, grain size = 1000, time = 0.152419
With memory allocation, grain size = 1, time = 0.351757
With memory allocation, grain size = 10, time = 0.320499
With memory allocation, grain size = 100, time = 0.314284
With memory allocation, grain size = 1000, time = 0.323573

With 4 threads:
Without memory allocation, grain size = 1, time = 0.162687
Without memory allocation, grain size = 10, time = 0.162498
Without memory allocation, grain size = 100, time = 0.146438
Without memory allocation, grain size = 1000, time = 0.150557
With memory allocation, grain size = 1, time = 0.192916
With memory allocation, grain size = 10, time = 0.200336
With memory allocation, grain size = 100, time = 0.196882
With memory allocation, grain size = 1000, time = 0.195918

With 8 threads:
Without memory allocation, grain size = 1, time = 0.160153
Without memory allocation, grain size = 10, time = 0.160778
Without memory allocation, grain size = 100, time = 0.161141
Without memory allocation, grain size = 1000, time = 0.161196
With memory allocation, grain size = 1, time = 0.198829
With memory allocation, grain size = 10, time = 0.199491
With memory allocation, grain size = 100, time = 0.199772
With memory allocation, grain size = 1000, time = 0.201396

Summary of results:
4 threads, without allocation, grain size = 1, speedup = 0.924997
4 threads, without allocation, grain size = 10, speedup = 0.93435
4 threads, without allocation, grain size = 100, speedup = 1.02083
4 threads, without allocation, grain size = 1000, speedup = 1.01237
4 threads, with allocation, grain size = 1, speedup = 1.82337
4 threads, with allocation, grain size = 10, speedup = 1.59981
4 threads, with allocation, grain size = 100, speedup = 1.59631
4 threads, with allocation, grain size = 1000, speedup = 1.65157
8 threads, without allocation, grain size = 1, speedup = 0.939633
8 threads, without allocation, grain size = 10, speedup = 0.944346
8 threads, without allocation, grain size = 100, speedup = 0.927691
8 threads, without allocation, grain size = 1000, speedup = 0.945551
8 threads, with allocation, grain size = 1, speedup = 1.76914
8 threads, with allocation, grain size = 10, speedup = 1.60658
8 threads, with allocation, grain size = 100, speedup = 1.57321
8 threads, with allocation, grain size = 1000, speedup = 1.60665

dellaert commented 4 years ago

Wondering whether this is something we can fix by looking at where we lose time. Also the amount of parallelism depends on a good ordering, hence we should investigate whether using Metis for example gives us better bang for the buck. Finally, we could share this in the docs and a possible blog post, reminding people about parallelism in the Bayes tree, and possibly providing a flag to try and use the parallel branch or not...

ProfFan commented 4 years ago

Note that the FindTBB.cmake in GTSAM is also out of date (cannot find TBB 2019.U0). Replacing the file from the VTK repo works flawlessly.

ProfFan commented 4 years ago

Note that the previous result is wrong. On my mac it is actually working, with max 4 times improvement with TBB. Assuming a bug specific to the environment (Ubuntu 16.04).

Got no time on this currently.

> $ ninja TimeTBB.run
[2/2] cd /Users/proffan/Projects/Development/VISION/gtsam_...n/Projects/Development/VISION/gtsam_build/examples/TimeTBB
/Users/proffan/Projects/Development/VISION/GTSAM/gtsam/3rdparty/Eigen/Eigen/src/Core/functors/UnaryFunctors.h:576:88: runtime error: division by zero
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/proffan/Projects/Development/VISION/GTSAM/gtsam/3rdparty/Eigen/Eigen/src/Core/functors/UnaryFunctors.h:576:88 in
numberOfProblems = 1000000
problemSize = 4
With 1 threads:
/usr/local/include/tbb/internal/../task.h:779:30: runtime error: member call on address 0x000116be3e00 which does not point to an object of type 'tbb::internal::scheduler'
0x000116be3e00: note: object is of type 'tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>'
 00 00 00 00  e8 1e 92 12 01 00 00 00  00 00 00 00 00 00 00 00  60 76 bf 16 01 00 00 00  60 76 bf 16
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/tbb/internal/../task.h:779:30 in
/usr/local/include/tbb/internal/../task.h:1046:23: runtime error: member call on address 0x000116be3e00 which does not point to an object of type 'tbb::internal::scheduler'
0x000116be3e00: note: object is of type 'tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>'
 00 00 00 00  e8 1e 92 12 01 00 00 00  00 00 00 00 00 00 00 00  60 76 bf 16 01 00 00 00  60 76 bf 16
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/include/tbb/internal/../task.h:1046:23 in
Without memory allocation, grain size = 1, time = 5.46733
Without memory allocation, grain size = 10, time = 5.59286
Without memory allocation, grain size = 100, time = 5.64539
Without memory allocation, grain size = 1000, time = 5.51933
With memory allocation, grain size = 1, time = 8.55949
With memory allocation, grain size = 10, time = 9.07178
With memory allocation, grain size = 100, time = 8.79069
With memory allocation, grain size = 1000, time = 8.66558

With 4 threads:
Without memory allocation, grain size = 1, time = 1.69261
Without memory allocation, grain size = 10, time = 1.68709
Without memory allocation, grain size = 100, time = 1.73469
Without memory allocation, grain size = 1000, time = 1.7691
With memory allocation, grain size = 1, time = 2.58719
With memory allocation, grain size = 10, time = 2.65104
With memory allocation, grain size = 100, time = 2.62247
With memory allocation, grain size = 1000, time = 2.74432

With 8 threads:
Without memory allocation, grain size = 1, time = 1.37712
Without memory allocation, grain size = 10, time = 1.46636
Without memory allocation, grain size = 100, time = 1.46375
Without memory allocation, grain size = 1000, time = 1.45783
With memory allocation, grain size = 1, time = 1.80873
With memory allocation, grain size = 10, time = 1.81393
With memory allocation, grain size = 100, time = 1.8269
With memory allocation, grain size = 1000, time = 1.84683

Summary of results:
4 threads, without allocation, grain size = 1, speedup = 3.23012
4 threads, without allocation, grain size = 10, speedup = 3.31508
4 threads, without allocation, grain size = 100, speedup = 3.25442
4 threads, without allocation, grain size = 1000, speedup = 3.11986
4 threads, with allocation, grain size = 1, speedup = 3.30841
4 threads, with allocation, grain size = 10, speedup = 3.42198
4 threads, with allocation, grain size = 100, speedup = 3.35207
4 threads, with allocation, grain size = 1000, speedup = 3.15765
8 threads, without allocation, grain size = 1, speedup = 3.97013
8 threads, without allocation, grain size = 10, speedup = 3.81411
8 threads, without allocation, grain size = 100, speedup = 3.8568
8 threads, without allocation, grain size = 1000, speedup = 3.78598
8 threads, with allocation, grain size = 1, speedup = 4.73232
8 threads, with allocation, grain size = 10, speedup = 5.00116
8 threads, with allocation, grain size = 100, speedup = 4.81182
8 threads, with allocation, grain size = 1000, speedup = 4.69213

ProfFan commented 4 years ago

For the UBSAN panic here, it is a problem with TBB, https://github.com/RcppCore/RcppParallel/issues/36

In light of the code quality, I strongly believe it is a issue with the Ubuntu 16.04 supplied TBB.

@izzys Could you help reproducing this bug on our side? Need your environment, TBB version, compiling command line, etc. Many thanks!

acxz commented 4 years ago

I can reproduce the issue: mkdir build && cd build && cmake .. && make TimeTBB

gist

Hardware: Intel i7-7500U (2) @ 3.5GHz (having only two cores prob affects the times at higher thread counts) OS: Arch Linux TBB: 2020.2 GCC: 9.3

ProfFan commented 4 years ago

I'll add this to my todo list, but not sure if I really have time on this.

dellaert commented 4 years ago

@ProfFan you do not have time for this :-) @acxz if you're motivated, this particular benchmark might not be the best to benchmark - rather, the other SolverComparer benchmark might.

zzodo commented 3 months ago

Any updates on this issue? I still can reproduce this on Ubuntu 22.04 LTS and system-default TBB(2021.5) in both 4.2.0 and develop branches. The test below was held on develop branch.

$ ./examples/TimeTBB 
numberOfProblems = 1000000
problemSize = 4
With 1 threads:
Without memory allocation, grain size = 1, time = 0.332967
Without memory allocation, grain size = 10, time = 0.328845
Without memory allocation, grain size = 100, time = 0.328481
Without memory allocation, grain size = 1000, time = 0.328192
With memory allocation, grain size = 1, time = 0.369558
With memory allocation, grain size = 10, time = 0.369653
With memory allocation, grain size = 100, time = 0.368168
With memory allocation, grain size = 1000, time = 0.368071

With 4 threads:
Without memory allocation, grain size = 1, time = 2.13116
Without memory allocation, grain size = 10, time = 2.10212
Without memory allocation, grain size = 100, time = 2.11296
Without memory allocation, grain size = 1000, time = 2.11572
With memory allocation, grain size = 1, time = 2.39639
With memory allocation, grain size = 10, time = 2.40664
With memory allocation, grain size = 100, time = 2.43013
With memory allocation, grain size = 1000, time = 2.43989

With 8 threads:
Without memory allocation, grain size = 1, time = 3.15854
Without memory allocation, grain size = 10, time = 3.17693
Without memory allocation, grain size = 100, time = 3.17387
Without memory allocation, grain size = 1000, time = 3.17985
With memory allocation, grain size = 1, time = 3.45604
With memory allocation, grain size = 10, time = 3.50903
With memory allocation, grain size = 100, time = 3.51825
With memory allocation, grain size = 1000, time = 3.52622

Summary of results:
4 threads, without allocation, grain size = 1, speedup = 0.156237
4 threads, without allocation, grain size = 10, speedup = 0.156435
4 threads, without allocation, grain size = 100, speedup = 0.15546
4 threads, without allocation, grain size = 1000, speedup = 0.155121
4 threads, with allocation, grain size = 1, speedup = 0.154214
4 threads, with allocation, grain size = 10, speedup = 0.153597
4 threads, with allocation, grain size = 100, speedup = 0.151501
4 threads, with allocation, grain size = 1000, speedup = 0.150855
8 threads, without allocation, grain size = 1, speedup = 0.105418
8 threads, without allocation, grain size = 10, speedup = 0.10351
8 threads, without allocation, grain size = 100, speedup = 0.103496
8 threads, without allocation, grain size = 1000, speedup = 0.10321
8 threads, with allocation, grain size = 1, speedup = 0.106931
8 threads, with allocation, grain size = 10, speedup = 0.105343
8 threads, with allocation, grain size = 100, speedup = 0.104645
8 threads, with allocation, grain size = 1000, speedup = 0.104381

GTSAM build information:

$ sudo cmake ..
-- GTSAM is a shared library due to GTSAM_FORCE_SHARED_LIB
-- GTSAM_POSE3_EXPMAP=ON, enabling GTSAM_ROT3_EXPMAP as well
-- Found Eigen version: 3.3.7
-- checking for thread-local storage - found
-- Could NOT find MKL (missing: MKL_INCLUDE_DIR MKL_LIBRARIES) 
-- Found Google perftools: 
-- Building 3rdparty
-- Could NOT find GeographicLib (missing: GeographicLib_LIBRARY_DIRS GeographicLib_LIBRARIES GeographicLib_INCLUDE_DIRS) 
-- Building base
-- Building basis
-- Building geometry
-- Building inference
-- Building symbolic
-- Building discrete
-- Building hybrid
-- Building linear
-- Building nonlinear
-- Building sam
-- Building sfm
-- Building slam
-- Building navigation
-- GTSAM Version: 4.3a0
-- Install prefix: /usr/local
-- Building GTSAM - as a SHARED library
-- Wrote /opt/gtsam/build/GTSAMConfig.cmake
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) 
-- ===============================================================
-- ================  Configuration Options  ======================
--  CMAKE_CXX_COMPILER_ID type                       : GNU
--  CMAKE_CXX_COMPILER_VERSION                       : 11.4.0
--  CMake version                                    : 3.22.1
--  CMake generator                                  : Unix Makefiles
--  CMake build tool                                 : /usr/bin/gmake
-- Build flags                                               
--  Build Tests                                      : Disabled
--  Build examples with 'make all'                   : Disabled
--  Build timing scripts with 'make all'             : Disabled
--  Build shared GTSAM libraries                     : Enabled
--  Put build type in library name                   : Enabled
--  Build libgtsam_unstable                          : Disabled
--  Build GTSAM unstable Python                      : Disabled
--  Build MATLAB Toolbox for unstable                : Disabled
--  Build for native architecture                    : Disabled
--  Build type                                       : Release
--  C compilation flags                              :  -O3 -DNDEBUG
--  C++ compilation flags                            :  -O3 -DNDEBUG
--  Enable Boost serialization                       : ON
--  GTSAM_COMPILE_FEATURES_PUBLIC                    : cxx_std_17
--  GTSAM_COMPILE_OPTIONS_PUBLIC                     : 
--  GTSAM_COMPILE_DEFINITIONS_PUBLIC                 : 
--  GTSAM_COMPILE_OPTIONS_PUBLIC_RELEASE             : 
--  GTSAM_COMPILE_DEFINITIONS_PUBLIC_RELEASE         : 
--  Use System Eigen                                 : ON (Using version: 3.3.7)
--  Use System Metis                                 : OFF
--  Using Boost version                              : 1.74.0
--  Use Intel TBB                                    : Yes (Version: 2021.5.0)
--  Eigen will use MKL                               : MKL not found
--  Eigen will use MKL and OpenMP                    : OpenMP found but GTSAM_WITH_EIGEN_MKL is disabled
--  Default allocator                                : TBB
--  Cheirality exceptions enabled                    : YES
--  Build with ccache                                : No
-- Packaging flags
--  CPack Source Generator                           : TGZ
--  CPack Generator                                  : TGZ
-- GTSAM flags                                               
--  Quaternions as default Rot3                      : Disabled
--  Runtime consistency checking                     : Disabled
--  Build with Memory Sanitizer                      : Disabled
--  Rot3 retract is full ExpMap                      : Enabled
--  Pose3 retract is full ExpMap                     : Enabled
--  Enable branch merging in DecisionTree            : Enabled
--  Allow features deprecated in GTSAM 4.3           : Enabled
--  Metis-based Nested Dissection                    : Enabled
--  Use tangent-space preintegration                 : Enabled
-- MATLAB toolbox flags
--  Install MATLAB toolbox                           : Disabled
-- Python toolbox flags                                      
--  Build Python module with pybind                  : Disabled
-- ===============================================================
-- Configuring done
-- Generating done
-- Build files have been written to: /opt/gtsam/build

zzodo commented 3 months ago

Another example with SolverComparer that mentioned above

$ ./examples/SolverComparer --incremental -d w10000 -o w_inc --threads 8
Loading dataset w10000
Using 8 threads
Looking for first measurement from step 0
Looks like 0 is the first time step, so adding a prior on it
Playing forward time steps...
chi2 = -nan
Step 0
-Total: 0 CPU (0 times, 0 wall, 0 children, min: 0 max: 0)
|   -Collect measurements: 0 CPU (1 times, 2e-06 wall, 0 children, min: 0 max: 0)
|   -Update ISAM2: 0 CPU (1 times, 2e-06 wall, 0 children, min: 0 max: 0)
|   -chi2: 0 CPU (1 times, 3.4e-05 wall, 0 children, min: 0 max: 0)
chi2 = 0.00172843
Step 1000
-Total: 0 CPU (0 times, 0 wall, 0.71 children, min: 0 max: 0)
|   -Collect measurements: 0.08 CPU (1001 times, 0.030624 wall, 0.08 children, min: 0 max: 0.01)
|   -Update ISAM2: 0.63 CPU (1001 times, 0.11614 wall, 0.63 children, min: 0 max: 0.01)
|   -chi2: 0 CPU (2 times, 0.000611 wall, 0 children, min: 0 max: 0)
chi2 = 0.00175299
Step 2000
-Total: 0 CPU (0 times, 0 wall, 1.85 children, min: 0 max: 0)
|   -Collect measurements: 0.18 CPU (2001 times, 0.093793 wall, 0.18 children, min: 0 max: 0.01)
|   -Update ISAM2: 1.67 CPU (2001 times, 0.334617 wall, 1.67 children, min: 0 max: 0.02)
|   -chi2: 0 CPU (3 times, 0.001946 wall, 0 children, min: 0 max: 0)
chi2 = 0.00177148
Step 3000
-Total: 0 CPU (0 times, 0 wall, 4.29 children, min: 0 max: 0)
|   -Collect measurements: 0.52 CPU (3001 times, 0.358602 wall, 0.52 children, min: 0 max: 0.01)
|   -Update ISAM2: 3.77 CPU (3001 times, 0.948901 wall, 3.77 children, min: 0 max: 0.02)
|   -chi2: 0 CPU (4 times, 0.005088 wall, 0 children, min: 0 max: 0)
chi2 = 0.00177683
Step 4000
-Total: 0 CPU (0 times, 0 wall, 7.88 children, min: 0 max: 0)
|   -Collect measurements: 1.09 CPU (4001 times, 0.882309 wall, 1.09 children, min: 0 max: 0.02)
|   -Update ISAM2: 6.78 CPU (4001 times, 1.9246 wall, 6.78 children, min: 0 max: 0.05)
|   -chi2: 0.01 CPU (5 times, 0.008837 wall, 0.01 children, min: 0.01 max: 0.01)
chi2 = 0.00177331
Step 5000
-Total: 0 CPU (0 times, 0 wall, 11.41 children, min: 0 max: 0)
|   -Collect measurements: 1.47 CPU (5001 times, 1.19369 wall, 1.47 children, min: 0 max: 0.02)
|   -Update ISAM2: 9.93 CPU (5001 times, 2.90427 wall, 9.93 children, min: 0.01 max: 0.05)
|   -chi2: 0.01 CPU (6 times, 0.014129 wall, 0.01 children, min: 0 max: 0.01)
chi2 = 0.00178298
Step 6000
-Total: 0 CPU (0 times, 0 wall, 16.1 children, min: 0 max: 0)
|   -Collect measurements: 2.55 CPU (6001 times, 2.11069 wall, 2.55 children, min: 0 max: 0.02)
|   -Update ISAM2: 13.54 CPU (6001 times, 4.45979 wall, 13.54 children, min: 0.01 max: 0.09)
|   -chi2: 0.01 CPU (7 times, 0.022692 wall, 0.01 children, min: 0 max: 0.01)
chi2 = 0.00177962
Step 7000
-Total: 0 CPU (0 times, 0 wall, 19.68 children, min: 0 max: 0)
|   -Collect measurements: 3.37 CPU (7001 times, 2.9156 wall, 3.37 children, min: 0 max: 0.02)
|   -Update ISAM2: 16.29 CPU (7001 times, 5.65427 wall, 16.29 children, min: 0 max: 0.11)
|   -chi2: 0.02 CPU (8 times, 0.029358 wall, 0.02 children, min: 0.01 max: 0.01)
chi2 = 0.00177708
Step 8000
-Total: 0 CPU (0 times, 0 wall, 23.28 children, min: 0 max: 0)
|   -Collect measurements: 4.16 CPU (8001 times, 3.72301 wall, 4.16 children, min: 0 max: 0.02)
|   -Update ISAM2: 19.09 CPU (8001 times, 6.79453 wall, 19.09 children, min: 0.01 max: 0.11)
|   -chi2: 0.03 CPU (9 times, 0.041096 wall, 0.03 children, min: 0.01 max: 0.01)
chi2 = 0.00177835
Step 9000
-Total: 0 CPU (0 times, 0 wall, 29.51 children, min: 0 max: 0)
|   -Collect measurements: 6.08 CPU (9001 times, 5.58775 wall, 6.08 children, min: 0 max: 0.02)
|   -Update ISAM2: 23.38 CPU (9001 times, 9.15137 wall, 23.38 children, min: 0 max: 0.17)
|   -chi2: 0.05 CPU (10 times, 0.055059 wall, 0.05 children, min: 0.02 max: 0.02)
Writing output file w_inc
unregistered class - derived class not registered or exported

$ ./examples/SolverComparer --incremental -d w10000 -o w_inc --threads 4
Loading dataset w10000
Using 4 threads
Looking for first measurement from step 0
Looks like 0 is the first time step, so adding a prior on it
Playing forward time steps...
chi2 = -nan
Step 0
-Total: 0 CPU (0 times, 0 wall, 0 children, min: 0 max: 0)
|   -Collect measurements: 0 CPU (1 times, 1e-06 wall, 0 children, min: 0 max: 0)
|   -Update ISAM2: 0 CPU (1 times, 1e-06 wall, 0 children, min: 0 max: 0)
|   -chi2: 0 CPU (1 times, 3.3e-05 wall, 0 children, min: 0 max: 0)
chi2 = 0.00172843
Step 1000
-Total: 0 CPU (0 times, 0 wall, 0.36 children, min: 0 max: 0)
|   -Collect measurements: 0.07 CPU (1001 times, 0.030023 wall, 0.07 children, min: 0 max: 0.01)
|   -Update ISAM2: 0.29 CPU (1001 times, 0.108681 wall, 0.29 children, min: 0 max: 0.01)
|   -chi2: 0 CPU (2 times, 0.00063 wall, 0 children, min: 0 max: 0)
chi2 = 0.00175299
Step 2000
-Total: 0 CPU (0 times, 0 wall, 0.98 children, min: 0 max: 0)
|   -Collect measurements: 0.15 CPU (2001 times, 0.091805 wall, 0.15 children, min: 0 max: 0.01)
|   -Update ISAM2: 0.82 CPU (2001 times, 0.31337 wall, 0.82 children, min: 0 max: 0.01)
|   -chi2: 0.01 CPU (3 times, 0.001879 wall, 0.01 children, min: 0.01 max: 0.01)
chi2 = 0.00177148
Step 3000
-Total: 0 CPU (0 times, 0 wall, 2.5 children, min: 0 max: 0)
|   -Collect measurements: 0.37 CPU (3001 times, 0.355865 wall, 0.37 children, min: 0 max: 0.01)
|   -Update ISAM2: 2.11 CPU (3001 times, 0.910435 wall, 2.11 children, min: 0 max: 0.02)
|   -chi2: 0.02 CPU (4 times, 0.00504 wall, 0.02 children, min: 0.01 max: 0.01)
chi2 = 0.00177683
Step 4000
-Total: 0 CPU (0 times, 0 wall, 4.89 children, min: 0 max: 0)
|   -Collect measurements: 0.92 CPU (4001 times, 0.877368 wall, 0.92 children, min: 0 max: 0.01)
|   -Update ISAM2: 3.94 CPU (4001 times, 1.85701 wall, 3.94 children, min: 0 max: 0.04)
|   -chi2: 0.03 CPU (5 times, 0.008958 wall, 0.03 children, min: 0.01 max: 0.01)
chi2 = 0.00177331
Step 5000
-Total: 0 CPU (0 times, 0 wall, 7.08 children, min: 0 max: 0)
|   -Collect measurements: 1.16 CPU (5001 times, 1.18555 wall, 1.16 children, min: 0 max: 0.01)
|   -Update ISAM2: 5.88 CPU (5001 times, 2.80783 wall, 5.88 children, min: 0 max: 0.04)
|   -chi2: 0.04 CPU (6 times, 0.014088 wall, 0.04 children, min: 0.01 max: 0.01)
chi2 = 0.00178298
Step 6000
-Total: 0 CPU (0 times, 0 wall, 10.64 children, min: 0 max: 0)
|   -Collect measurements: 2.02 CPU (6001 times, 2.10637 wall, 2.02 children, min: 0 max: 0.01)
|   -Update ISAM2: 8.57 CPU (6001 times, 4.3782 wall, 8.57 children, min: 0.01 max: 0.09)
|   -chi2: 0.05 CPU (7 times, 0.023237 wall, 0.05 children, min: 0.01 max: 0.01)
chi2 = 0.00177962
Step 7000
-Total: 0 CPU (0 times, 0 wall, 13.63 children, min: 0 max: 0)
|   -Collect measurements: 2.9 CPU (7001 times, 2.91521 wall, 2.9 children, min: 0 max: 0.01)
|   -Update ISAM2: 10.67 CPU (7001 times, 5.61952 wall, 10.67 children, min: 0 max: 0.09)
|   -chi2: 0.06 CPU (8 times, 0.030003 wall, 0.06 children, min: 0.01 max: 0.01)
chi2 = 0.00177708
Step 8000
-Total: 0 CPU (0 times, 0 wall, 16.33 children, min: 0 max: 0)
|   -Collect measurements: 3.77 CPU (8001 times, 3.71799 wall, 3.77 children, min: 0 max: 0.02)
|   -Update ISAM2: 12.49 CPU (8001 times, 6.74208 wall, 12.49 children, min: 0.01 max: 0.09)
|   -chi2: 0.07 CPU (9 times, 0.041187 wall, 0.07 children, min: 0.01 max: 0.01)
chi2 = 0.00177835
Step 9000
-Total: 0 CPU (0 times, 0 wall, 21.79 children, min: 0 max: 0)
|   -Collect measurements: 5.86 CPU (9001 times, 5.61159 wall, 5.86 children, min: 0 max: 0.02)
|   -Update ISAM2: 15.85 CPU (9001 times, 9.07511 wall, 15.85 children, min: 0.01 max: 0.11)
|   -chi2: 0.08 CPU (10 times, 0.055581 wall, 0.08 children, min: 0.01 max: 0.01)
Writing output file w_inc
unregistered class - derived class not registered or exported

My laptop has hybrid CPU Intel Core i7-13700H and I also tried TBB version 2021.12, which is newer than v2021.9.0 that is announced to be compatible with the hybrid CPUs.

borglab / gtsam

TimeTBB example: no speedup - actually slower by a factor X20 #92

Description

Steps to reproduce

Expected behavior

Environment

121