Open funmonty opened 5 years ago
Hi, @funmonty
Thank you for your question.
Winner-Takes-All is one of the most computationally intensive parts in SGM. So, we made maximum efforts for optimization, resulting in rather complicated code. The same applies to Cost-Aggregation.
For example, in Winner-Takes-All we assign 1 warp to 1 "line", a loop [0 to MAX_DISP-1] at specific image position (x, y). By doing so, reduction(taking minimum cost) is done in a warp using effective warp-level primitives, without communication inter different warps. This leads 4 disparities per 1 thread (if MAX_DISP=128), corresponding to REDUCTION_PER_THREAD.
Regards
@atakagi-fixstars Thank you for your explanation on the WTA process! Your implementation is really awesome and I have difficulties in understanding the highly optimized kernels! Can you please also give some intuition about the cost aggregation process. For example, in the horizontal path aggregation kernel:
Thank you in advance!
thanks for explain, that is really helpful to understand the code;
Hi, @funmonty
Thank you for your question.
Winner-Takes-All is one of the most computationally intensive parts in SGM. So, we made maximum efforts for optimization, resulting in rather complicated code. The same applies to Cost-Aggregation.
For example, in Winner-Takes-All we assign 1 warp to 1 "line", a loop [0 to MAX_DISP-1] at specific image position (x, y). By doing so, reduction(taking minimum cost) is done in a warp using effective warp-level primitives, without communication inter different warps. This leads 4 disparities per 1 thread (if MAX_DISP=128), corresponding to REDUCTION_PER_THREAD.
Regards
Hi,
I was trying to understand the implementation of SGM in the code. Although the cuda stuffs are doing their job perfectly, I have the following question : Why the indexing in some kernels pretty complex? For eg, the _check_consistencykernel has neat and understandable indexing where the kernel configuration is fully dependent on image size. Whereas _winner_takes_allkernel does not have indexing with image rows, columns but they use warp-id and lane-ids. Is this done for some kind of optimizations or is it over-engineered?
I can map the theoretical concept of SGM with the implementation but the indexing part is a bit hard to understand especially wherever warp-level primitives are used.
Cheers.