Could you help me figure out why my examination result shows rootba is slow than ceres and schur complement

varyshare commented 3 years ago

I am reading your paper <Square Root Bundle Adjustment for Large-Scale Reconstruction, CVPR2021>. Your idea of using QR decomposition instead of traditional Schur Complement is awesome. I have run your source code `rootba`. The result image is shown in the end of the issue. From the picture, we can see QR-32(single precision QR in rootba) is slow than ceres and schur complement. I was puzzle about it. Could you help me figure out it?

#!/usr/bin/env bash

MY_EXAM_DATA_FOLDER="./rootba_testing_data_thread16"
declare -a my_exames=("qr32" "qr64" "sc64" "sc32" "ceres")
for i in "${my_exames[@]}"
do
    mkdir -p $MY_EXAM_DATA_FOLDER/$i
done

DATA_ROOT_PATH=/home/shaoping/readcode/rootba/data
./bin/bal -C $MY_EXAM_DATA_FOLDER/qr32/ --num-threads 0 --no-debug --no-use-double --use-householder-marginalization --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/qr64/ --num-threads 0 --no-debug --use-double --use-householder-marginalization --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/sc64/ --num-threads 0 --no-debug --solver-type SCHUR_COMPLEMENT  --use-double  --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/sc32/ --num-threads 0 --no-debug --solver-type SCHUR_COMPLEMENT --no-use-double  --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"
./bin/bal -C $MY_EXAM_DATA_FOLDER/ceres/ --num-threads 0 --no-debug --solver-type CERES --use-double  --input "$DATA_ROOT_PATH/rootba/bal/ladybug/problem-49-7776-pre.txt"

./scripts/plot-logs.py $MY_EXAM_DATA_FOLDER

NikolausDemmel commented 3 years ago

Thanks for your interest. I'll have a look shortly.

NikolausDemmel commented 3 years ago

In general, the relative performance of the different methods in our experience can depend a lot on the hardware and number of CPU cores. One aspect is that from our experiments it seems to better take advantage of parallelization. Also, which method is faster depends a lot on the actual problem. We are not claiming that rootba has better runtime in all situations.

That being said, I've tried your script with the current master and compiled with default settings on two machines, and this is what I get.

2013 Macbook (i7 with 8 virtual cores):

macbook

Ubuntu 18.04 Desktop (Xeon W-2133 with 12 virtual cores):

linux

I'm not sure why you see something qualitatively very different. What hardware are you running on?

Two thoughts:

Are you actually running multithreaded with multiple cores?
Are you using OpenBLAS? Maybe try exporting in OPENBLAS_NUM_THREADS=1 in you shell before running rootba to make sure the multithreading in OpenBLAS is not interfering with the use of TBB in rootba. (See also the note about OpenBLAS in the readme, which has pointers explaining this in more details.)

varyshare commented 3 years ago

Thank you for helping me! ! ! I will try it according to your suggestion. If we run in thread=1, will the experiment result be similar to multithreading?

NikolausDemmel commented 3 years ago

If we run in thread=1, will the experiment result be similar to multithreading?

No, I expect different outcome with different number of threads. Note that OPENBLAS_NUM_THREADS=1 is unrealated to the number of threads you configure for ceres and rootba. This is controlled with the --num-threads command line argument (or corresponding config entry). But in your script you are already setting it to 0, meaning it should use the number of hardware threads. Maybe the detection of number of hardware threads is faulty. You can try passing an explicit value. For example, try --num-threads 8 if you have a CPU with 8 (virtual) cores.

varyshare commented 3 years ago

Hello,
I checked my running environment. my processor is Intel® Core™ i7-10700 CPU @ 2.90GHz × 16, it has 16 cores. And I didn't install OpenBlas. After set --num-threads 8 , the result remains to be ceres faster than rootba (both QR and SC). I will try another machine tomorrow. I guess may the TBB couldn't call the multi thread in my machine. Thank you again.

NikolausDemmel commented 3 years ago

That's a bit strange. Yeah, maybe it is an issue with TBB. Your ceres runtime is similar to my Linux box, but the others are much slower, which is very surprising if it does indeed use multi-threading. Ceres does not use TBB in our configuration AFAIK, so it could make sense.

Maybe you can have a look yourself, but otherwise, you could post here your OS and maybe the full output of a fresh ./scripts/build-external.sh and (after deleting the build folder) ./scripts/build-rootba.sh plus the full command line output of your script. Maybe there is something in the logs that looks odd.

If you are using Ubuntu, you can double check which BLAS is configured with (just on case openblas got installed as a dependency of something):

update-alternatives --get-selections | grep "blas\|lapack"

DengueTim commented 2 years ago

Hi, I've been playing with this on a Macbook Air M1 with 8 and 4 threads. Using the ./bin/bal executable produces the expected results. However if I run the individual ./bin/bal_sc and ./bin/bal_qr executables the total_time accumulated doesn't show as pronounced results. Being 0.312s, 0.522s & 0.344s for qr32, qr64 & sc64 respectively. Also the total_time's are about 50 times smaller compared the to times from /bin/bal. The error looks the same. Why the big difference in runtime?

NikolausDemmel commented 1 year ago

That's very curious. Are you sure you have built all the binaries with the same configuration? Beware that by default ROOTBA_DEVELOPER_MODE is ON, which means even if you have different build folders (e.g. for debug or release), all binaries end up in the same bin folder.

Can you try wipe the bin and build folder and recompile all binaries? If you still see a difference another thing to confirm is that you are using the same config in all cases. Could you please paste the full command line call and output for all 3 runs?

NikolausDemmel / rootba

Could you help me figure out why my examination result shows rootba is slow than ceres and schur complement #2