Closed GiulioRomualdi closed 3 years ago
To be noted the different iterations between the two mumps versions. I wonder what could be the cause.
Note that in the Scenario 1 the number of evaluations is different from Scenario 2, so I guess that ipopt version is playing a role here. As when you compile ipopt from source you are using 3.13.4 and apt as quite an old ipopt 3.11.9 (https://repology.org/project/ipopt/versions), I wonder if this is relevant. Just a recap, I think that this are the versions used now (@GiulioRomualdi correct me if I am wrong):
Scenario | IPOPT version | MUMPS version |
---|---|---|
Scenario 1 | 3.11.9 | 5.2.1 |
Scenario 2 | 3.13.4 | 4.10.0 |
Comment https://github.com/dic-iit/bipedal-locomotion-framework/issues/219#issuecomment-788897904 updated with the correct MUMPS version
I think it would be interested to test this with conda-forge provided binaries, as in there instead we have a recent ipopt with a recent mumps. My WSL2 installation is now a bit tricky to use due to max disk usage, but if you want to try the installation steps should be just:
Install miniforge3
following the instructions in https://github.com/robotology/robotology-superbuild/blob/master/doc/conda-forge.md#install-a-conda-distribution . Note that this should install stuff only on ~/Miniforge3, so you don't risk to spoil you laptop because in any moment you can just rm -rf ~/Miniforge3
to remove all of it.
Once you installed miniforge3
, you can create an environment for this tests and activate it, installing all the required dependencies:
conda create -n blf-perf-test
conda activate blf-perf-test
conda install -c robotology-staging idyntree yarp eigen qhull casadi cppad ipopt manif
Then, create a new build for blf and configure and build cmake as usual, but in a terminal in which you activated the blf-perf-test
environment
Run the TimeVaryingDCMPlannerTest
test as usual
This should give us a Scenario 3 at least with IPOPT + MUMPS, with: | Scenario | IPOPT version | MUMPS version |
---|---|---|---|
Scenario 3 | 3.13.4 | 5.2.1 |
As you could read also on Twitter, manif is now available on conda-forge, so I edited the previous comment to reflect that.
I run the scenario in this docker image: https://hub.docker.com/r/condaforge/miniforge3
******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
Ipopt is released as open source code under the Eclipse Public License (EPL).
For more information visit https://github.com/coin-or/Ipopt
******************************************************************************
solver : t_proc (avg) t_wall (avg) n_eval
nlp_f | 4.38ms ( 57.66us) 4.39ms ( 57.76us) 76
nlp_g | 5.68ms ( 74.72us) 5.67ms ( 74.59us) 76
nlp_grad | 273.00us (273.00us) 272.63us (272.63us) 1
nlp_grad_f | 2.62ms (104.96us) 2.62ms (104.96us) 25
nlp_hess_l | 4.25ms (202.38us) 4.26ms (202.67us) 21
nlp_jac_g | 4.55ms (168.52us) 4.56ms (168.75us) 27
total | 2.80 s ( 2.80 s) 1.63 s ( 1.63 s) 1
Scenario 3 - Conda
I run the scenario in this docker image: https://hub.docker.com/r/condaforge/miniforge3
****************************************************************************** This program contains Ipopt, a library for large-scale nonlinear optimization. Ipopt is released as open source code under the Eclipse Public License (EPL). For more information visit https://github.com/coin-or/Ipopt ****************************************************************************** solver : t_proc (avg) t_wall (avg) n_eval nlp_f | 4.38ms ( 57.66us) 4.39ms ( 57.76us) 76 nlp_g | 5.68ms ( 74.72us) 5.67ms ( 74.59us) 76 nlp_grad | 273.00us (273.00us) 272.63us (272.63us) 1 nlp_grad_f | 2.62ms (104.96us) 2.62ms (104.96us) 25 nlp_hess_l | 4.25ms (202.38us) 4.26ms (202.67us) 21 nlp_jac_g | 4.55ms (168.52us) 4.56ms (168.75us) 27 total | 2.80 s ( 2.80 s) 1.63 s ( 1.63 s) 1
3 seconds??
3 seconds??
I know :sob:
@GiulioRomualdi can you please post the output of mamba list
?
# packages in environment at /opt/conda/envs/blf-perf-test:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
ace 7.0.0 h9c3ff4c_1 conda-forge
ampl-mp 3.1.0 h616b090_1004 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
ca-certificates 2020.12.5 ha878542_0 conda-forge
casadi 3.5.5 py39h19f53c4_3 conda-forge
catch2 2.13.4 h4bd325d_0 conda-forge
certifi 2020.12.5 py39hf3d152e_1 conda-forge
cppad 20210000.5 h9c3ff4c_0 conda-forge
dbus 1.13.6 hfdff14a_1 conda-forge
eigen 3.3.9 h4bd325d_1 conda-forge
expat 2.2.10 h9c3ff4c_0 conda-forge
fontconfig 2.13.1 hba837de_1004 conda-forge
freeglut 3.2.1 h9c3ff4c_2 conda-forge
freetype 2.10.4 h0708190_1 conda-forge
gettext 0.19.8.1 h0b5b191_1005 conda-forge
glib 2.66.7 h9c3ff4c_1 conda-forge
glib-tools 2.66.7 h9c3ff4c_1 conda-forge
gsl 2.6 he838d99_2 conda-forge
gst-plugins-base 1.18.3 h04508c2_0 conda-forge
gstreamer 1.18.3 h3560a44_0 conda-forge
icu 68.1 h58526e2_0 conda-forge
icub-main 1.19.0 h3fd9d12_2 robotology-staging
idyntree 3.0.0 h3fd9d12_2 robotology-staging
ipopt 3.13.4 h7ede334_0 conda-forge
irrlicht 1.8.4 h4da5807_0 conda-forge
jpeg 9d h36c2ea0_0 conda-forge
krb5 1.17.2 h926e7f8_0 conda-forge
ld_impl_linux-64 2.35.1 hea4e1c9_2 conda-forge
libblas 3.9.0 8_openblas conda-forge
libcblas 3.9.0 8_openblas conda-forge
libclang 11.1.0 default_ha53f305_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libevent 2.1.10 hcdb4288_3 conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc-ng 9.3.0 h2828fa1_18 conda-forge
libgfortran-ng 9.3.0 hff62375_18 conda-forge
libgfortran5 9.3.0 hff62375_18 conda-forge
libglib 2.66.7 h3e27bee_1 conda-forge
libgomp 9.3.0 h2828fa1_18 conda-forge
libiconv 1.16 h516909a_0 conda-forge
libjpeg-turbo 2.0.5 h516909a_0 conda-forge
liblapack 3.9.0 8_openblas conda-forge
libllvm11 11.1.0 hf817b99_0 conda-forge
libopenblas 0.3.12 pthreads_h4812303_1 conda-forge
libosqp 0.6.2 h9c3ff4c_1 conda-forge
libpng 1.6.37 h21135ba_2 conda-forge
libpq 13.1 hfd2b0eb_2 conda-forge
libstdcxx-ng 9.3.0 h6de172a_18 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libxcb 1.13 h7f98852_1003 conda-forge
libxkbcommon 1.0.3 he3ba5ed_0 conda-forge
libxml2 2.9.10 h72842e0_3 conda-forge
lz4-c 1.9.3 h9c3ff4c_0 conda-forge
manif 0.0.3 h9c3ff4c_0 conda-forge
metis 5.1.0 h58526e2_1006 conda-forge
mumps-include 5.2.1 ha770c72_10 conda-forge
mumps-seq 5.2.1 h47a8eb5_10 conda-forge
mysql-common 8.0.23 ha770c72_1 conda-forge
mysql-libs 8.0.23 h935591d_1 conda-forge
ncurses 6.2 h58526e2_4 conda-forge
nspr 4.29 h9c3ff4c_1 conda-forge
nss 3.62 hb5efdd6_0 conda-forge
numpy 1.20.1 py39hdbf815f_0 conda-forge
openssl 1.1.1j h7f98852_0 conda-forge
osqp-eigen 0.6.2 h3fd9d12_2 robotology-staging
pcre 8.44 he1b5a44_0 conda-forge
pip 21.0.1 pyhd8ed1ab_0 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
python 3.9.2 hffdb5ce_0_cpython conda-forge
python_abi 3.9 1_cp39 conda-forge
qhull 2020.2 h4bd325d_0 conda-forge
qt 5.12.9 hda022c4_4 conda-forge
readline 8.0 he28a2e2_2 conda-forge
robot-testing-framework 2.0.1 h3fd9d12_2 robotology-staging
scotch 6.0.9 h0eec0ba_1 conda-forge
sdl 1.2.15 he1b5a44_1 conda-forge
setuptools 49.6.0 py39hf3d152e_3 conda-forge
sqlite 3.34.0 h74cdb3f_0 conda-forge
tinyxml 2.6.2 h4bd325d_2 conda-forge
tk 8.6.10 h21135ba_1 conda-forge
tzdata 2021a he74cb21_0 conda-forge
wheel 0.36.2 pyhd3deb0d_0 conda-forge
xorg-fixesproto 5.0 h14c3975_1002 conda-forge
xorg-inputproto 2.3.2 h7f98852_1002 conda-forge
xorg-kbproto 1.0.7 h7f98852_1002 conda-forge
xorg-libx11 1.6.12 h516909a_0 conda-forge
xorg-libxau 1.0.9 h7f98852_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xorg-libxext 1.3.4 h516909a_0 conda-forge
xorg-libxfixes 5.0.3 h516909a_1004 conda-forge
xorg-libxi 1.7.10 h516909a_0 conda-forge
xorg-xextproto 7.3.0 h7f98852_1002 conda-forge
xorg-xproto 7.0.31 h7f98852_1007 conda-forge
xz 5.2.5 h516909a_1 conda-forge
yarp 3.4.3 h3fd9d12_2 robotology-staging
ycm-cmake-modules 0.12.1 h3fd9d12_2 robotology-staging
zlib 1.2.11 h516909a_1010 conda-forge
zstd 1.4.9 ha95c52a_0 conda-forge
There is something strange here. The main reason behind the 3 seconds are:
nlp_f
: for both Scenario 1 and 2 the average time for nlp_f
is 13/14 us, while for Scenario 3 is 57 us . This may indicate that somethings is different in the optimization level either of blf or one of its dependencies used in the implementation of nlp_f
. To check the first point, could you provide the CMakeCache.txt
of Scenario 1 or 2 and of Scenario 3?I reproduced Scenario 1 and Scenario 3 in GitHub Actions to have a bit of (imperfect) reproducibility in https://github.com/dic-iit/bipedal-locomotion-framework/runs/2043873756, and the results are:
33: Test command: /home/runner/work/bipedal-locomotion-framework/bipedal-locomotion-framework/build/bin/TimeVaryingDCMPlannerUnitTests
33: Test timeout computed to be: 1500
33: [StdImplementation::getParameterPrivate] Parameter named linear_solver not found.
33: [StdImplementation::getParameterPrivate] Parameter named use_external_dcm_reference not found.
33: [StdImplementation::getParameterPrivate] Parameter named gravity not found.
33:
33: ******************************************************************************
33: This program contains Ipopt, a library for large-scale nonlinear optimization.
33: Ipopt is released as open source code under the Eclipse Public License (EPL).
33: For more information visit http://projects.coin-or.org/Ipopt
33: ******************************************************************************
33:
33: solver : t_proc (avg) t_wall (avg) n_eval
33: nlp_f | 1.61ms ( 21.18us) 1.59ms ( 20.93us) 76
33: nlp_g | 2.48ms ( 32.66us) 2.41ms ( 31.78us) 76
33: nlp_grad | 158.00us (158.00us) 158.70us (158.70us) 1
33: nlp_grad_f | 1.22ms ( 48.80us) 1.11ms ( 44.56us) 25
33: nlp_hess_l | 2.02ms ( 96.33us) 1.91ms ( 91.10us) 21
33: nlp_jac_g | 2.14ms ( 79.11us) 1.99ms ( 73.69us) 27
33: total | 586.33ms (586.33ms) 564.48ms (564.48ms) 1
33: ===============================================================================
33: All tests passed (157 assertions in 1 test case)
33:
1/2 Test #33: TimeVaryingDCMPlannerUnitTests ............ Passed 0.69 sec
17: Test command: /home/runner/work/bipedal-locomotion-framework/bipedal-locomotion-framework/build/bin/TimeVaryingDCMPlannerUnitTests
17: Test timeout computed to be: 1500
17: [StdImplementation::getParameterPrivate] Parameter named linear_solver not found.
17: [StdImplementation::getParameterPrivate] Parameter named use_external_dcm_reference not found.
17: [StdImplementation::getParameterPrivate] Parameter named gravity not found.
17:
17: ******************************************************************************
17: This program contains Ipopt, a library for large-scale nonlinear optimization.
17: Ipopt is released as open source code under the Eclipse Public License (EPL).
17: For more information visit https://github.com/coin-or/Ipopt
17: ******************************************************************************
17:
17: solver : t_proc (avg) t_wall (avg) n_eval
17: nlp_f | 1.49ms ( 20.96us) 1.50ms ( 21.09us) 71
17: nlp_g | 2.39ms ( 33.66us) 2.32ms ( 32.66us) 71
17: nlp_grad | 134.00us (134.00us) 134.70us (134.70us) 1
17: nlp_grad_f | 1.19ms ( 49.75us) 1.07ms ( 44.75us) 24
17: nlp_hess_l | 1.68ms ( 83.90us) 1.68ms ( 84.12us) 20
17: nlp_jac_g | 2.13ms ( 82.00us) 2.02ms ( 77.52us) 26
17: total | 2.12 s ( 2.12 s) 2.06 s ( 2.06 s) 1
17: ===============================================================================
17: All tests passed (157 assertions in 1 test case)
17:
1/1 Test #17: TimeVaryingDCMPlannerUnitTests ... Passed 2.19 sec
In any case, the total time seems to be much bigger then the sum of the evaluations, so probably most of the time is spent by the linear solver (mumps) and by ipopt itself.
My night guess is that mumps in conda and debian is compiled with O1 optimization level, see https://github.com/conda-forge/mumps-feedstock/blob/master/recipe/Makefile.conda.SEQ#L60 and https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html . I did not follow everything in Coinbrew, but I suspect that there mumps is compiled with O3 or similar.
My night guess is that mumps in conda and debian is compiled with O1 optimization level, see https://github.com/conda-forge/mumps-feedstock/blob/master/recipe/Makefile.conda.SEQ#L60 and https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html . I did not follow everything in Coinbrew, but I suspect that there mumps is compiled with O3 or similar.
This is false. I looked a bit more in the logs of mumps, ipopt and casadi in the various sceneratios, and this are the compilation options used (I still don't know the options used by coinbrew in @GiulioRomualdi case). For Ubuntu I was not able to get the actual builds for Focal, but I looked at the last one for Debian Sid as the packages did not seem to have changed a lot. Note that this is just a comparison of compilation options, while also library versions may play a role here as well.
Scenario | mumps | ipopt | casadi | bfl |
---|---|---|---|---|
1 (apt) | gfortran -g -O2 -fstack-protector-strong -fallow-argument-mismatch (https://buildd.debian.org/status/fetch.php?pkg=mumps&arch=amd64&ver=5.3.5-1&stamp=1603953824&raw=0) |
-g -O2 -fstack-protector-strong (https://buildd.debian.org/status/fetch.php?pkg=coinor-ipopt&arch=amd64&ver=3.11.9-2.2%2Bb4&stamp=1604026881&raw=0) |
-O3 -DNDEBUG (built with CMake in Release mode) |
-O3 -DNDEBUG (built with CMake in Release mode) |
2 (coinbrew) | ?? | ?? | -O3 -DNDEBUG (built with CMake in Release mode) |
-O3 -DNDEBUG (built with CMake in Release mode) |
3 (conda) | -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe (https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=239069&view=logs&j=b41866ee-27a7-5872-d10c-0bcb2e16c629&t=a9c7b177-1873-544a-be44-6094513b43d2 ) |
-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe (https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=281623&view=logs&jobId=656edd35-690f-5c53-9ba3-09c10d0bea97&j=656edd35-690f-5c53-9ba3-09c10d0bea97&t=e5c8ab1d-8ff9-5cae-b332-e15ae582ed2d) |
-fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -fPIC -O3 -DNDEBUG -MD -MT (https://github.com/conda-forge/casadi-feedstock/pull/50/checks?check_run_id=2046616733) |
-O3 -DNDEBUG (built with CMake in Release mode) if compilers are not installed and the system compilers are used (as in @GiulioRomualdi case) or -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/traversaro/miniforge3/envs/blf/include -O3 -DNDEBUG -fPIC in the case show in GitHub Actions CI. |
So, there is a lot of different options here, and it may be difficult to isolate the reason why coinbrew builds are so much faster without more systematic tests, the few things to notice are:
-O2
that is not the default optimization level of CMake in Release mode (-O3
)-march=nocona -mtune=haswell
that may play a role-O2
and -O3
options are passed to the compiler, and I don't know the one that the compiler actually uses. Main take home message:
make VERBOSE=1
or ninja -v
so we can save the actual compilation flags used. When CMake projects are built in conda-forge with conda-forge provided compilers, both
-O2
and-O3
options are passed to the compiler, and I don't know the one that the compiler actually uses.
From GCC docs (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options):
If you use multiple -O options, with or without level numbers, the last such option is the one that is effective.
So I guess that actually -O3
is used for Casadi and blf in Scenario 3.
Probably if coinbrew is just passing O3 to compile all libraries, it may be worth to quickly try to do that also for conda packages, and see if that is the dominant factor or if some of the other specific options play a big role.
By the way, if version numbers and optimization flags play such a big role in the final benchmark speed, it would be fun to be choosing on an application-specific way via an optimization process in which the cost function is the time performance of the benchmark, and the optimization variables are library versions and compilation options. Probably a basic grid search or bayesian optimization could be used for that. That would be an interesting use of build-from-source based package managers such as CMake-based superbuilds (such as robotology-superbuild), spack or boa experimental support for building packages from source (fyi @wolfv).
Related issue opened on https://github.com/robotology/robotology-superbuild/issues/659 based on yesterdays meeting. I also set up a repo for doing some more "systematic" testing via GitHub Actions (even on a system quite different from the one in which we deploy the code for actual use on the robot): https://github.com/traversaro/ipopt-walking-benchmarks/pull/1 . As we agreed to close this issue, do you think it make sense that I used a new issue in blf to discuss the progress on https://github.com/traversaro/ipopt-walking-benchmarks/pull/1 ? @S-Dafarra @GiulioRomualdi @prashanthr05
As we agreed to close this issue, do you think it make sense that I used an issue in blf to discuss the progress on traversaro/ipopt-walking-benchmarks#1 ? @S-Dafarra @GiulioRomualdi @prashanthr05
I have nothing against it
sure @traversaro you can also close this issue and open a new one to track the progress on https://github.com/traversaro/ipopt-walking-benchmarks/
As we now have the two follow ups:
I think we can close this issue.
In this issue, I want to perform an analysis of the time required to compute the DCM trajectory using the
TimeVaryingDCMPlanner
in case of different configurations of my laptop.Click me if you are interested in the specs of the laptop
``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Stepping: 9 CPU MHz: 2518.163 CPU max MHz: 3800,0000 CPU min MHz: 800,0000 BogoMIPS: 5599.85 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology no nstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pc id sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsb ase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xge tbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d ```
To benchmark the performances I will run the
TimeVaryingDCMPlannerTest
in different scenarios. All the tests have been performed with the following OSScenario 1 (Normal use scenario)
In this scenario, I installed
ipopt
andmumps
using apt (sudo apt install coinor-libipopt-dev
).CasADi
compiled from sourceThe
TimeVaryingDCMPlannerTest
runs usingmumps
as linear solver. These are the performancesScenario 2 (Advance use scenario -
mumps
)In this scenario, I installed
ipopt
andmumps
from the source. You can find the installation procedure here.The
TimeVaryingDCMPlannerTest
runs usingmumps
as the linear solver. These are the performancesScenario 2 (Advance use scenario -
ma27
)In this scenario, I installed
ipopt
andma27
from the source. You can find the installation procedure here.The
TimeVaryingDCMPlannerTest
runs usingma27
as linear solver. These are the performances.cc @traversaro @S-Dafarra @diegoferigo @paolo-viceconte @prashanthr05 @raffaello-camoriano @DanielePucci @Giulero