fixstars / libSGM

Stereo Semi Global Matching by cuda
Apache License 2.0
596 stars 187 forks source link

Papers on which code is based and parameter optimization #6

Open steve3nto opened 7 years ago

steve3nto commented 7 years ago

I have managed to run the code on some stereo pair images from the Middlebury dataset and the result is quite ok.

But when I try to use it on other rectified stereo pairs from my own cameras or from other datasets the result is much different (and much worse).

Could you please provide references to the papers on which the code is based? That would help in understanding the code.

Can you also give some hints on how to optimize it to run on different stereo datasets/images? I saw right now only one parameter can be passed, the max_disparity, and it can be only 64 or 128. Why is it so?

Are there any other parameters that can be set inside the code? I am thinking about the classic parameters of SGM: window_size, cost penalties p1 and p2, number of search lines for semi-global-matching (4,8,16...) et cetera!

Thanks for your help!

ghost commented 7 years ago

Thank you for testing!

Could you please provide references to the papers on which the code is based?

This code is mainly based on http://www.6d-vision.com/9-literatur/hirschmueller_cvpr05 , but the matcing cost is census-based.

the max_disparity, and it can be only 64 or 128. Why is it so?

This is because 64 or 128 are suitable for CUDA optimization (these are multiples of warp size).

classic parameters of SGM: window_size, cost penalties p1 and p2, number of search lines for semi-global-matching (4,8,16...)

steve3nto commented 7 years ago

Ok thanks for the reply!

Also I have a question regarding performance. Did you check how fast the code runs and if using CUDA gives a good speedup? Cause right now if I time the clock cycles around ssgm.execute(...) it seems to take a lot of time. Like 400ms for a single stereo pair.

ghost commented 7 years ago

Hi,

Cause right now if I time the clock cycles around ssgm.execute(...) it seems to take a lot of time. Like 400ms for a single stereo pair.

The performance depends on your setup (GPU, image size, disparity range, ...). Could you tell me your setup?

I ran the image sample on my environment and the calculation time was 26.7 [ms] (CPU: Core i7 920, GPU: Geforce GTX 680). Also, The result of Jetson X1 is following: http://proc-cpuinfo.fixstars.com/2015/12/realizing-self-driving-cars-with-html-2/ (Table. 1)

Perhaps, it is because of CUDA API. The first time CUDA API call is often slow. Did you check that?

steve3nto commented 7 years ago

Interesting article, thanks!

Anyway I have quite a powerful machine here at the university

So It should take less than 100ms per frame for sure, hopefully less than 20ms!

Am I doing something wrong in the way I time the execution? Or maybe there is some problem with Cuda 8 or with the Pascal architecture?

This is how I was testing it now:

int start_s=clock();
ssgm.execute(left.data, right.data, (void**)&output.data);
int stop_s=clock();
std::cout << "time [ms]: " << (stop_s-start_s)/double(CLOCKS_PER_SEC)*1000 << std::endl;
std::cout << "FPS: " << 1000/((stop_s-start_s)/double(CLOCKS_PER_SEC)*1000) << std::endl;

and this is on a single image so it is a very bad estimate of the performance for sure, but still I think it shouldn't be more than 100ms.

Do you know how I can check the execution speed a bit better?

ghost commented 7 years ago

I tried this code. The first call is dummy CUDA call. Calculate average of 10 times execution. Try this:

index 33c4812..b681c63 100644
--- sample/image/stereosgm_image.cpp
+++ sample/image/stereosgm_image.cpp
@@ -17,6 +17,7 @@ limitations under the License.
 #include <stdlib.h>
 #include <iostream>

+#include <time.h>
 #include <opencv2/core/core.hpp>
 #include <opencv2/highgui/highgui.hpp>
 #include <opencv2/imgproc/imgproc.hpp>
@@ -57,6 +58,18 @@ int main(int argc, char* argv[]) {
        cv::Mat output(cv::Size(left.cols, left.rows), CV_8UC1);

        ssgm.execute(left.data, right.data, (void**)&output.data);
+       int n = 10;
+       double d = 0;
+       clock_t c1, c2;
+       for (int i = 0; i < n; i++)
+       {
+               c1 = clock();
+               ssgm.execute(left.data, right.data, (void**)&output.data);
+               c2 = clock();
+               d += (double)(c2 - c1) * 1000 / CLOCKS_PER_SEC;
+       }
+       d /= (double)n;
+       std::cerr << "Elapsed: " << d << "[ms]" << std::endl;

        // show image
        cv::imshow("image", output * 256 / disp_size);

FYI: http://stackoverflow.com/questions/41098237/is-the-warmup-code-necessary-when-measuring-cuda-kernel-running-time

steve3nto commented 7 years ago

Ok thanks! I did not know the startup phase of Cuda API could make such a big difference.

Now I get an estimate of less than 11ms on a stereo pair from Kitti of size 1242x376 pixels.

Performance is indeed much faster than CPU only code!

mhkabir commented 7 years ago

@ykitta-fixstars Did you guys do any testing with Tegra K1?

ghost commented 7 years ago

@mhkabir I don't testing on Tegra K1 recently.

Diksha-Moolchandani commented 3 years ago

@steve3nto What is the error for KITTI dataset that you get using this algorithm?