Indexing of CUDA kernels

fixstars / libSGM

Stereo Semi Global Matching by cuda

Apache License 2.0

615 stars 188 forks source link

Indexing of CUDA kernels #26

Open funmonty opened 5 years ago

funmonty commented 5 years ago

Hi,

I was trying to understand the implementation of SGM in the code. Although the cuda stuffs are doing their job perfectly, I have the following question : Why the indexing in some kernels pretty complex? For eg, the _check_consistencykernel has neat and understandable indexing where the kernel configuration is fully dependent on image size. Whereas _winner_takes_allkernel does not have indexing with image rows, columns but they use warp-id and lane-ids. Is this done for some kind of optimizations or is it over-engineered?

I can map the theoretical concept of SGM with the implementation but the indexing part is a bit hard to understand especially wherever warp-level primitives are used.

Cheers.

atakagi-fixstars commented 5 years ago

Hi, @funmonty

Thank you for your question.

Winner-Takes-All is one of the most computationally intensive parts in SGM. So, we made maximum efforts for optimization, resulting in rather complicated code. The same applies to Cost-Aggregation.

For example, in Winner-Takes-All we assign 1 warp to 1 "line", a loop [0 to MAX_DISP-1] at specific image position (x, y). By doing so, reduction(taking minimum cost) is done in a warp using effective warp-level primitives, without communication inter different warps. This leads 4 disparities per 1 thread (if MAX_DISP=128), corresponding to REDUCTION_PER_THREAD.

Regards

xiayeqingfeng commented 2 years ago

@atakagi-fixstars Thank you for your explanation on the WTA process! Your implementation is really awesome and I have difficulties in understanding the highly optimized kernels! Can you please also give some intuition about the cost aggregation process. For example, in the horizontal path aggregation kernel:

It seems a block is taking responsibility for the aggregation of multiple rows of the image. If so, what job does a warp and a single thread do?
what is a subgroup and what does the DP_BLOCKS_PER_THREAD mean?
the lane_id compute here is really confusing to me since normally we directly use threadIdx.x % WARP_SIZE to get the lane id.

Thank you in advance!

ynma-hanvo commented 2 years ago

thanks for explain, that is really helpful to understand the code;

Hi, @funmonty

Thank you for your question.

Winner-Takes-All is one of the most computationally intensive parts in SGM. So, we made maximum efforts for optimization, resulting in rather complicated code. The same applies to Cost-Aggregation.

For example, in Winner-Takes-All we assign 1 warp to 1 "line", a loop [0 to MAX_DISP-1] at specific image position (x, y). By doing so, reduction(taking minimum cost) is done in a warp using effective warp-level primitives, without communication inter different warps. This leads 4 disparities per 1 thread (if MAX_DISP=128), corresponding to REDUCTION_PER_THREAD.

Regards