NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 825 forks source link

why load repeatedly when receiving in prims_ll128 #1213

Open echobinarybytes opened 8 months ago

echobinarybytes commented 8 months ago

Hi, I am confused about the code segment here, https://github.com/NVIDIA/nccl/blob/master/src/device/prims_ll128.h#L250

In Do-While-Loop, we already load the data to vr[] array When checking the flag right.

However, in the next For-Loop we re-load the data again. Can the re-load process be deleted? Why can't?

/************************ Wait first recv ********************/
    if (RECV) {
      uint64_t* ptr = recvPtr(0)+ll128Offset;
      uint64_t flag = recvFlag(0);
      bool needReload;
      int spins = 0;
      do {
        needReload = false;
        #pragma unroll
        for (int u=0; u<ELEMS_PER_THREAD; u+=2) {
          load128(ptr+u*WARP_SIZE, vr[u], vr[u+1]);
          needReload |= flagThread && (vr[u+1] != flag);
        }
        needReload &= (0 == checkAbort(spins, 0, 0));
      } while (__any_sync(WARP_MASK, needReload));

      #pragma unroll
      for (int u=0; u<ELEMS_PER_THREAD; u+=2)
        load128(ptr+u*WARP_SIZE, vr[u], vr[u+1]);
    }
shanleo2024 commented 6 months ago

mark