Open varyshare opened 3 years ago
Thanks for your interest. I'll have a look shortly.
In general, the relative performance of the different methods in our experience can depend a lot on the hardware and number of CPU cores. One aspect is that from our experiments it seems to better take advantage of parallelization. Also, which method is faster depends a lot on the actual problem. We are not claiming that rootba has better runtime in all situations.
That being said, I've tried your script with the current master and compiled with default settings on two machines, and this is what I get.
2013 Macbook (i7 with 8 virtual cores):
Ubuntu 18.04 Desktop (Xeon W-2133 with 12 virtual cores):
I'm not sure why you see something qualitatively very different. What hardware are you running on?
Two thoughts:
OPENBLAS_NUM_THREADS=1
in you shell before running rootba to make sure the multithreading in OpenBLAS is not interfering with the use of TBB in rootba. (See also the note about OpenBLAS in the readme, which has pointers explaining this in more details.)Thank you for helping me! ! ! I will try it according to your suggestion. If we run in thread=1, will the experiment result be similar to multithreading?
If we run in thread=1, will the experiment result be similar to multithreading?
No, I expect different outcome with different number of threads. Note that OPENBLAS_NUM_THREADS=1
is unrealated to the number of threads you configure for ceres and rootba. This is controlled with the --num-threads
command line argument (or corresponding config entry). But in your script you are already setting it to 0, meaning it should use the number of hardware threads. Maybe the detection of number of hardware threads is faulty. You can try passing an explicit value. For example, try --num-threads 8
if you have a CPU with 8 (virtual) cores.
Hello,
I checked my running environment. my processor is Intel® Core™ i7-10700 CPU @ 2.90GHz × 16, it has 16 cores.
And I didn't install OpenBlas. After set --num-threads 8
, the result remains to be ceres
faster than rootba (both QR and SC)
. I will try another machine tomorrow. I guess may the TBB couldn't call the multi thread in my machine.
Thank you again.
That's a bit strange. Yeah, maybe it is an issue with TBB. Your ceres runtime is similar to my Linux box, but the others are much slower, which is very surprising if it does indeed use multi-threading. Ceres does not use TBB in our configuration AFAIK, so it could make sense.
Maybe you can have a look yourself, but otherwise, you could post here your OS and maybe the full output of a fresh ./scripts/build-external.sh
and (after deleting the build
folder) ./scripts/build-rootba.sh
plus the full command line output of your script. Maybe there is something in the logs that looks odd.
If you are using Ubuntu, you can double check which BLAS is configured with (just on case openblas got installed as a dependency of something):
update-alternatives --get-selections | grep "blas\|lapack"
Hi, I've been playing with this on a Macbook Air M1 with 8 and 4 threads. Using the ./bin/bal
executable produces the expected results. However if I run the individual ./bin/bal_sc
and ./bin/bal_qr
executables the total_time
accumulated doesn't show as pronounced results. Being 0.312s, 0.522s & 0.344s for qr32, qr64 & sc64 respectively. Also the total_time
's are about 50 times smaller compared the to times from /bin/bal
. The error looks the same. Why the big difference in runtime?
That's very curious. Are you sure you have built all the binaries with the same configuration? Beware that by default ROOTBA_DEVELOPER_MODE
is ON
, which means even if you have different build folders (e.g. for debug or release), all binaries end up in the same bin
folder.
Can you try wipe the bin and build folder and recompile all binaries? If you still see a difference another thing to confirm is that you are using the same config in all cases. Could you please paste the full command line call and output for all 3 runs?
I am reading your paper <Square Root Bundle Adjustment for Large-Scale Reconstruction, CVPR2021>. Your idea of using QR decomposition instead of traditional Schur Complement is awesome. I have run your source code
rootba
. The result image is shown in the end of the issue. From the picture, we can see QR-32(single precision QR in rootba) is slow than ceres and schur complement. I was puzzle about it. Could you help me figure out it?