Perform performances analysis for the TimeVarying DCM planner in case of different PC configurations

GiulioRomualdi commented 3 years ago

In this issue, I want to perform an analysis of the time required to compute the DCM trajectory using the TimeVaryingDCMPlanner in case of different configurations of my laptop.

Click me if you are interested in the specs of the laptop

``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Stepping: 9 CPU MHz: 2518.163 CPU max MHz: 3800,0000 CPU min MHz: 800,0000 BogoMIPS: 5599.85 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology no nstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pc id sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb ase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xge tbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d ```

To benchmark the performances I will run the TimeVaryingDCMPlannerTest in different scenarios. All the tests have been performed with the following OS

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:    20.04
Codename:   focal

Scenario 1 (Normal use scenario)

In this scenario, I installed ipopt and mumps using apt (sudo apt install coinor-libipopt-dev). CasADi compiled from source

The TimeVaryingDCMPlannerTest runs using mumps as linear solver. These are the performances

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

      solver  :   t_proc      (avg)   t_wall      (avg)    n_eval
       nlp_f  |   1.03ms ( 13.51us)   1.03ms ( 13.59us)        76
       nlp_g  |   1.67ms ( 22.01us)   1.67ms ( 22.00us)        76
    nlp_grad  |  92.00us ( 92.00us)  92.16us ( 92.16us)         1
  nlp_grad_f  | 720.00us ( 28.80us) 723.98us ( 28.96us)        25
  nlp_hess_l  |   1.18ms ( 56.00us)   1.18ms ( 56.14us)        21
   nlp_jac_g  |   1.41ms ( 52.30us)   1.42ms ( 52.54us)        27
       total  | 395.71ms (395.71ms) 399.98ms (399.98ms)         1

Scenario 2 (Advance use scenario - `mumps`)

In this scenario, I installed ipopt and mumps from the source. You can find the installation procedure here.

The TimeVaryingDCMPlannerTest runs using mumps as the linear solver. These are the performances

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit https://github.com/coin-or/Ipopt
******************************************************************************

      solver  :   t_proc      (avg)   t_wall      (avg)    n_eval
       nlp_f  | 260.00us ( 14.44us) 259.99us ( 14.44us)        18
       nlp_g  | 406.00us ( 22.56us) 403.42us ( 22.41us)        18
    nlp_grad  |  82.00us ( 82.00us)  81.98us ( 81.98us)         1
  nlp_grad_f  | 463.00us ( 24.37us) 465.53us ( 24.50us)        19
  nlp_hess_l  | 867.00us ( 51.00us) 871.50us ( 51.26us)        17
   nlp_jac_g  | 831.00us ( 43.74us) 834.87us ( 43.94us)        19
       total  |  98.09ms ( 98.09ms)  98.10ms ( 98.10ms)         1

Scenario 2 (Advance use scenario - `ma27`)

In this scenario, I installed ipopt and ma27 from the source. You can find the installation procedure here.

The TimeVaryingDCMPlannerTest runs using ma27 as linear solver. These are the performances.

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit https://github.com/coin-or/Ipopt
******************************************************************************

      solver  :   t_proc      (avg)   t_wall      (avg)    n_eval
       nlp_f  | 236.00us ( 13.11us) 237.79us ( 13.21us)        18
       nlp_g  | 393.00us ( 21.83us) 391.46us ( 21.75us)        18
    nlp_grad  |  73.00us ( 73.00us)  73.56us ( 73.56us)         1
  nlp_grad_f  | 445.00us ( 23.42us) 445.69us ( 23.46us)        19
  nlp_hess_l  | 825.00us ( 48.53us) 825.54us ( 48.56us)        17
   nlp_jac_g  | 833.00us ( 43.84us) 837.87us ( 44.10us)        19
       total  |  29.91ms ( 29.91ms)  29.91ms ( 29.91ms)         1

cc @traversaro @S-Dafarra @diegoferigo @paolo-viceconte @prashanthr05 @raffaello-camoriano @DanielePucci @Giulero

S-Dafarra commented 3 years ago

To be noted the different iterations between the two mumps versions. I wonder what could be the cause.

traversaro commented 3 years ago

Note that in the Scenario 1 the number of evaluations is different from Scenario 2, so I guess that ipopt version is playing a role here. As when you compile ipopt from source you are using 3.13.4 and apt as quite an old ipopt 3.11.9 (https://repology.org/project/ipopt/versions), I wonder if this is relevant. Just a recap, I think that this are the versions used now (@GiulioRomualdi correct me if I am wrong):

Scenario	IPOPT version	MUMPS version
Scenario 1	3.11.9	5.2.1
Scenario 2	3.13.4	4.10.0

GiulioRomualdi commented 3 years ago

Comment https://github.com/dic-iit/bipedal-locomotion-framework/issues/219#issuecomment-788897904 updated with the correct MUMPS version

traversaro commented 3 years ago

I think it would be interested to test this with conda-forge provided binaries, as in there instead we have a recent ipopt with a recent mumps. My WSL2 installation is now a bit tricky to use due to max disk usage, but if you want to try the installation steps should be just:

Install miniforge3 following the instructions in https://github.com/robotology/robotology-superbuild/blob/master/doc/conda-forge.md#install-a-conda-distribution . Note that this should install stuff only on ~/Miniforge3, so you don't risk to spoil you laptop because in any moment you can just rm -rf ~/Miniforge3 to remove all of it.

Once you installed miniforge3, you can create an environment for this tests and activate it, installing all the required dependencies:

conda create -n blf-perf-test
conda activate blf-perf-test
conda install -c robotology-staging idyntree yarp eigen qhull casadi cppad ipopt manif

Then, create a new build for blf and configure and build cmake as usual, but in a terminal in which you activated the blf-perf-test environment
Run the TimeVaryingDCMPlannerTest test as usual

This should give us a Scenario 3 at least with IPOPT + MUMPS, with:	Scenario	IPOPT version	MUMPS version
Scenario 3	3.13.4	5.2.1

traversaro commented 3 years ago

As you could read also on Twitter, manif is now available on conda-forge, so I edited the previous comment to reflect that.

GiulioRomualdi commented 3 years ago

Scenario 3 - Conda

I run the scenario in this docker image: https://hub.docker.com/r/condaforge/miniforge3

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit https://github.com/coin-or/Ipopt
******************************************************************************

      solver  :   t_proc      (avg)   t_wall      (avg)    n_eval
       nlp_f  |   4.38ms ( 57.66us)   4.39ms ( 57.76us)        76
       nlp_g  |   5.68ms ( 74.72us)   5.67ms ( 74.59us)        76
    nlp_grad  | 273.00us (273.00us) 272.63us (272.63us)         1
  nlp_grad_f  |   2.62ms (104.96us)   2.62ms (104.96us)        25
  nlp_hess_l  |   4.25ms (202.38us)   4.26ms (202.67us)        21
   nlp_jac_g  |   4.55ms (168.52us)   4.56ms (168.75us)        27
       total  |   2.80 s (  2.80 s)   1.63 s (  1.63 s)         1

S-Dafarra commented 3 years ago

Scenario 3 - Conda

I run the scenario in this docker image: https://hub.docker.com/r/condaforge/miniforge3

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit https://github.com/coin-or/Ipopt
******************************************************************************

      solver  :   t_proc      (avg)   t_wall      (avg)    n_eval
       nlp_f  |   4.38ms ( 57.66us)   4.39ms ( 57.76us)        76
       nlp_g  |   5.68ms ( 74.72us)   5.67ms ( 74.59us)        76
    nlp_grad  | 273.00us (273.00us) 272.63us (272.63us)         1
  nlp_grad_f  |   2.62ms (104.96us)   2.62ms (104.96us)        25
  nlp_hess_l  |   4.25ms (202.38us)   4.26ms (202.67us)        21
   nlp_jac_g  |   4.55ms (168.52us)   4.56ms (168.75us)        27
       total  |   2.80 s (  2.80 s)   1.63 s (  1.63 s)         1

3 seconds??

GiulioRomualdi commented 3 years ago

3 seconds??

I know :sob:

diegoferigo commented 3 years ago

@GiulioRomualdi can you please post the output of mamba list?

GiulioRomualdi commented 3 years ago

# packages in environment at /opt/conda/envs/blf-perf-test:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
ace                       7.0.0                h9c3ff4c_1    conda-forge
ampl-mp                   3.1.0             h616b090_1004    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2020.12.5            ha878542_0    conda-forge
casadi                    3.5.5            py39h19f53c4_3    conda-forge
catch2                    2.13.4               h4bd325d_0    conda-forge
certifi                   2020.12.5        py39hf3d152e_1    conda-forge
cppad                     20210000.5           h9c3ff4c_0    conda-forge
dbus                      1.13.6               hfdff14a_1    conda-forge
eigen                     3.3.9                h4bd325d_1    conda-forge
expat                     2.2.10               h9c3ff4c_0    conda-forge
fontconfig                2.13.1            hba837de_1004    conda-forge
freeglut                  3.2.1                h9c3ff4c_2    conda-forge
freetype                  2.10.4               h0708190_1    conda-forge
gettext                   0.19.8.1          h0b5b191_1005    conda-forge
glib                      2.66.7               h9c3ff4c_1    conda-forge
glib-tools                2.66.7               h9c3ff4c_1    conda-forge
gsl                       2.6                  he838d99_2    conda-forge
gst-plugins-base          1.18.3               h04508c2_0    conda-forge
gstreamer                 1.18.3               h3560a44_0    conda-forge
icu                       68.1                 h58526e2_0    conda-forge
icub-main                 1.19.0               h3fd9d12_2    robotology-staging
idyntree                  3.0.0                h3fd9d12_2    robotology-staging
ipopt                     3.13.4               h7ede334_0    conda-forge
irrlicht                  1.8.4                h4da5807_0    conda-forge
jpeg                      9d                   h36c2ea0_0    conda-forge
krb5                      1.17.2               h926e7f8_0    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libblas                   3.9.0                8_openblas    conda-forge
libcblas                  3.9.0                8_openblas    conda-forge
libclang                  11.1.0          default_ha53f305_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libevent                  2.1.10               hcdb4288_3    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgfortran-ng            9.3.0               hff62375_18    conda-forge
libgfortran5              9.3.0               hff62375_18    conda-forge
libglib                   2.66.7               h3e27bee_1    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
libjpeg-turbo             2.0.5                h516909a_0    conda-forge
liblapack                 3.9.0                8_openblas    conda-forge
libllvm11                 11.1.0               hf817b99_0    conda-forge
libopenblas               0.3.12          pthreads_h4812303_1    conda-forge
libosqp                   0.6.2                h9c3ff4c_1    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libpq                     13.1                 hfd2b0eb_2    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libxcb                    1.13              h7f98852_1003    conda-forge
libxkbcommon              1.0.3                he3ba5ed_0    conda-forge
libxml2                   2.9.10               h72842e0_3    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
manif                     0.0.3                h9c3ff4c_0    conda-forge
metis                     5.1.0             h58526e2_1006    conda-forge
mumps-include             5.2.1               ha770c72_10    conda-forge
mumps-seq                 5.2.1               h47a8eb5_10    conda-forge
mysql-common              8.0.23               ha770c72_1    conda-forge
mysql-libs                8.0.23               h935591d_1    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
nspr                      4.29                 h9c3ff4c_1    conda-forge
nss                       3.62                 hb5efdd6_0    conda-forge
numpy                     1.20.1           py39hdbf815f_0    conda-forge
openssl                   1.1.1j               h7f98852_0    conda-forge
osqp-eigen                0.6.2                h3fd9d12_2    robotology-staging
pcre                      8.44                 he1b5a44_0    conda-forge
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
python                    3.9.2           hffdb5ce_0_cpython    conda-forge
python_abi                3.9                      1_cp39    conda-forge
qhull                     2020.2               h4bd325d_0    conda-forge
qt                        5.12.9               hda022c4_4    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
robot-testing-framework   2.0.1                h3fd9d12_2    robotology-staging
scotch                    6.0.9                h0eec0ba_1    conda-forge
sdl                       1.2.15               he1b5a44_1    conda-forge
setuptools                49.6.0           py39hf3d152e_3    conda-forge
sqlite                    3.34.0               h74cdb3f_0    conda-forge
tinyxml                   2.6.2                h4bd325d_2    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
tzdata                    2021a                he74cb21_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xorg-fixesproto           5.0               h14c3975_1002    conda-forge
xorg-inputproto           2.3.2             h7f98852_1002    conda-forge
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libx11               1.6.12               h516909a_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h516909a_0    conda-forge
xorg-libxfixes            5.0.3             h516909a_1004    conda-forge
xorg-libxi                1.7.10               h516909a_0    conda-forge
xorg-xextproto            7.3.0             h7f98852_1002    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yarp                      3.4.3                h3fd9d12_2    robotology-staging
ycm-cmake-modules         0.12.1               h3fd9d12_2    robotology-staging
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge

traversaro commented 3 years ago

There is something strange here. The main reason behind the 3 seconds are:

76 iterations, similar to Scenario 1. This may indicate that the 76 iterations are kind of related to MUMPS 5
The time taken by the call to the cost and constraint functions. To focus just on one, let's focus on nlp_f : for both Scenario 1 and 2 the average time for nlp_f is 13/14 us, while for Scenario 3 is 57 us . This may indicate that somethings is different in the optimization level either of blf or one of its dependencies used in the implementation of nlp_f. To check the first point, could you provide the CMakeCache.txtof Scenario 1 or 2 and of Scenario 3?

traversaro commented 3 years ago

I reproduced Scenario 1 and Scenario 3 in GitHub Actions to have a bit of (imperfect) reproducibility in https://github.com/dic-iit/bipedal-locomotion-framework/runs/2043873756, and the results are:

Scenario 1 - GitHub Actions

33: Test command: /home/runner/work/bipedal-locomotion-framework/bipedal-locomotion-framework/build/bin/TimeVaryingDCMPlannerUnitTests
33: Test timeout computed to be: 1500
33: [StdImplementation::getParameterPrivate] Parameter named linear_solver not found.
33: [StdImplementation::getParameterPrivate] Parameter named use_external_dcm_reference not found.
33: [StdImplementation::getParameterPrivate] Parameter named gravity not found.
33: 
33: ******************************************************************************
33: This program contains Ipopt, a library for large-scale nonlinear optimization.
33:  Ipopt is released as open source code under the Eclipse Public License (EPL).
33:          For more information visit http://projects.coin-or.org/Ipopt
33: ******************************************************************************
33: 
33:       solver  :   t_proc      (avg)   t_wall      (avg)    n_eval
33:        nlp_f  |   1.61ms ( 21.18us)   1.59ms ( 20.93us)        76
33:        nlp_g  |   2.48ms ( 32.66us)   2.41ms ( 31.78us)        76
33:     nlp_grad  | 158.00us (158.00us) 158.70us (158.70us)         1
33:   nlp_grad_f  |   1.22ms ( 48.80us)   1.11ms ( 44.56us)        25
33:   nlp_hess_l  |   2.02ms ( 96.33us)   1.91ms ( 91.10us)        21
33:    nlp_jac_g  |   2.14ms ( 79.11us)   1.99ms ( 73.69us)        27
33:        total  | 586.33ms (586.33ms) 564.48ms (564.48ms)         1
33: ===============================================================================
33: All tests passed (157 assertions in 1 test case)
33: 
1/2 Test #33: TimeVaryingDCMPlannerUnitTests ............   Passed    0.69 sec

Scenario 3 - GitHub Actions


17: Test command: /home/runner/work/bipedal-locomotion-framework/bipedal-locomotion-framework/build/bin/TimeVaryingDCMPlannerUnitTests
17: Test timeout computed to be: 1500
17: [StdImplementation::getParameterPrivate] Parameter named linear_solver not found.
17: [StdImplementation::getParameterPrivate] Parameter named use_external_dcm_reference not found.
17: [StdImplementation::getParameterPrivate] Parameter named gravity not found.
17: 
17: ******************************************************************************
17: This program contains Ipopt, a library for large-scale nonlinear optimization.
17:  Ipopt is released as open source code under the Eclipse Public License (EPL).
17:          For more information visit https://github.com/coin-or/Ipopt
17: ******************************************************************************
17: 
17:       solver  :   t_proc      (avg)   t_wall      (avg)    n_eval
17:        nlp_f  |   1.49ms ( 20.96us)   1.50ms ( 21.09us)        71
17:        nlp_g  |   2.39ms ( 33.66us)   2.32ms ( 32.66us)        71
17:     nlp_grad  | 134.00us (134.00us) 134.70us (134.70us)         1
17:   nlp_grad_f  |   1.19ms ( 49.75us)   1.07ms ( 44.75us)        24
17:   nlp_hess_l  |   1.68ms ( 83.90us)   1.68ms ( 84.12us)        20
17:    nlp_jac_g  |   2.13ms ( 82.00us)   2.02ms ( 77.52us)        26
17:        total  |   2.12 s (  2.12 s)   2.06 s (  2.06 s)         1
17: ===============================================================================
17: All tests passed (157 assertions in 1 test case)
17: 
1/1 Test #17: TimeVaryingDCMPlannerUnitTests ...   Passed    2.19 sec

In any case, the total time seems to be much bigger then the sum of the evaluations, so probably most of the time is spent by the linear solver (mumps) and by ipopt itself.

traversaro commented 3 years ago

My night guess is that mumps in conda and debian is compiled with O1 optimization level, see https://github.com/conda-forge/mumps-feedstock/blob/master/recipe/Makefile.conda.SEQ#L60 and https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html . I did not follow everything in Coinbrew, but I suspect that there mumps is compiled with O3 or similar.

traversaro commented 3 years ago

My night guess is that mumps in conda and debian is compiled with O1 optimization level, see https://github.com/conda-forge/mumps-feedstock/blob/master/recipe/Makefile.conda.SEQ#L60 and https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html . I did not follow everything in Coinbrew, but I suspect that there mumps is compiled with O3 or similar.

This is false. I looked a bit more in the logs of mumps, ipopt and casadi in the various sceneratios, and this are the compilation options used (I still don't know the options used by coinbrew in @GiulioRomualdi case). For Ubuntu I was not able to get the actual builds for Focal, but I looked at the last one for Debian Sid as the packages did not seem to have changed a lot. Note that this is just a comparison of compilation options, while also library versions may play a role here as well.

Scenario	mumps	ipopt	casadi	bfl
1 (apt)	`gfortran -g -O2 -fstack-protector-strong -fallow-argument-mismatch` (https://buildd.debian.org/status/fetch.php?pkg=mumps&arch=amd64&ver=5.3.5-1&stamp=1603953824&raw=0)	`-g -O2 -fstack-protector-strong` (https://buildd.debian.org/status/fetch.php?pkg=coinor-ipopt&arch=amd64&ver=3.11.9-2.2%2Bb4&stamp=1604026881&raw=0)	`-O3 -DNDEBUG` (built with CMake in Release mode)	`-O3 -DNDEBUG` (built with CMake in Release mode)
2 (coinbrew)	??	??	`-O3 -DNDEBUG` (built with CMake in Release mode)	`-O3 -DNDEBUG` (built with CMake in Release mode)
3 (conda)	`-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe` (https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=239069&view=logs&j=b41866ee-27a7-5872-d10c-0bcb2e16c629&t=a9c7b177-1873-544a-be44-6094513b43d2 )	`-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe` (https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=281623&view=logs&jobId=656edd35-690f-5c53-9ba3-09c10d0bea97&j=656edd35-690f-5c53-9ba3-09c10d0bea97&t=e5c8ab1d-8ff9-5cae-b332-e15ae582ed2d)	`-fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -fPIC -O3 -DNDEBUG -MD -MT` (https://github.com/conda-forge/casadi-feedstock/pull/50/checks?check_run_id=2046616733)	`-O3 -DNDEBUG` (built with CMake in Release mode) if `compilers` are not installed and the system compilers are used (as in @GiulioRomualdi case) or `-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/traversaro/miniforge3/envs/blf/include -O3 -DNDEBUG -fPIC` in the case show in GitHub Actions CI.

So, there is a lot of different options here, and it may be difficult to isolate the reason why coinbrew builds are so much faster without more systematic tests, the few things to notice are:

Binary builds in apt and conda seem to use -O2 that is not the default optimization level of CMake in Release mode (-O3 )
Conda build pass the optimization related flags -march=nocona -mtune=haswell that may play a role
When CMake projects are built in conda-forge with conda-forge provided compilers, both -O2 and -O3 options are passed to the compiler, and I don't know the one that the compiler actually uses.

Main take home message:

Whenever we do compilation tests, let's always compile with make VERBOSE=1 or ninja -v so we can save the actual compilation flags used.

traversaro commented 3 years ago

When CMake projects are built in conda-forge with conda-forge provided compilers, both -O2 and -O3 options are passed to the compiler, and I don't know the one that the compiler actually uses.

From GCC docs (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options):

If you use multiple -O options, with or without level numbers, the last such option is the one that is effective.

So I guess that actually -O3 is used for Casadi and blf in Scenario 3.

traversaro commented 3 years ago

Probably if coinbrew is just passing O3 to compile all libraries, it may be worth to quickly try to do that also for conda packages, and see if that is the dominant factor or if some of the other specific options play a big role.

Off Topic

By the way, if version numbers and optimization flags play such a big role in the final benchmark speed, it would be fun to be choosing on an application-specific way via an optimization process in which the cost function is the time performance of the benchmark, and the optimization variables are library versions and compilation options. Probably a basic grid search or bayesian optimization could be used for that. That would be an interesting use of build-from-source based package managers such as CMake-based superbuilds (such as robotology-superbuild), spack or boa experimental support for building packages from source (fyi @wolfv).

traversaro commented 3 years ago

Related issue opened on https://github.com/robotology/robotology-superbuild/issues/659 based on yesterdays meeting. I also set up a repo for doing some more "systematic" testing via GitHub Actions (even on a system quite different from the one in which we deploy the code for actual use on the robot): https://github.com/traversaro/ipopt-walking-benchmarks/pull/1 . As we agreed to close this issue, do you think it make sense that I used a new issue in blf to discuss the progress on https://github.com/traversaro/ipopt-walking-benchmarks/pull/1 ? @S-Dafarra @GiulioRomualdi @prashanthr05

S-Dafarra commented 3 years ago

As we agreed to close this issue, do you think it make sense that I used an issue in blf to discuss the progress on traversaro/ipopt-walking-benchmarks#1 ? @S-Dafarra @GiulioRomualdi @prashanthr05

I have nothing against it

GiulioRomualdi commented 3 years ago

sure @traversaro you can also close this issue and open a new one to track the progress on https://github.com/traversaro/ipopt-walking-benchmarks/

traversaro commented 3 years ago

As we now have the two follow ups:

I think we can close this issue.

ami-iit / bipedal-locomotion-framework