Closed trinayan closed 3 years ago
that function is part of the CUDA programming APIs, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-match-functions now sue why you can't resolve it.
line 25 is returns for each thread how many threads in the warp are accessing the same cache line. So let's say a warp with 32 active threads, accesses 1 cache line with the first 20 threads and another cache line with the other 12 threads, then after instruction at line 25 you have 20 threads which variable cnt=20 and 12 threads which variable cnt=12. If you add cnt to the uniq_lines without scaling (1.0f/cnt) you get [ 20 threads 20 cnt + 12 threads 12 cnt ] = 544 , which does not tell you much. But if you add [20 threads 1/20 cnt + 12 threads 1/12 cnt ] = 2, which is the number of lines accessed by the warp.
Thanks for the comment. It seems it was because I was using the default makefiles where it was compiling for SM_35 on which this function is not supported. The divergence calculation makes perfect sense now. So thanks for the detailed explanation.
match_any_sync is supported for >cc7.0
Here are the compute capabilities for different GPUs.
For a device with sm < sm_70, we need an alternative for match_any_sync()
Hi, did you solve the issue? I am not able to get the desired output. I suppose I am not passing the correct arguments to the instrumentation function. Can you please let me know what arguments did you pass?
Hi,
I was trying to implement the code for the memory divergence example shown in the paper in Listing 8. I encounter two issues.
First "match_any_sync" function which is used here "int cnt = popc(match_any_sync(mask, cache_addr))" doesn't seem to be a valid function in the nvbit library. I am not sure how to resolve this and what to use in its place instead.
Second "line 29" in the example "atomicAdd(&uniq_lines, 1.0f / cnt);". I feel like it should be atomicAdd(&uniq_lines, cnt) instead based on my understanding of memory divergence. Not sure if I am correct.
Thanks for the help.