Open echobinarybytes opened 8 months ago
Hi, I am confused about the code segment here, https://github.com/NVIDIA/nccl/blob/master/src/device/prims_ll128.h#L250
In Do-While-Loop, we already load the data to vr[] array When checking the flag right.
However, in the next For-Loop we re-load the data again. Can the re-load process be deleted? Why can't?
/************************ Wait first recv ********************/ if (RECV) { uint64_t* ptr = recvPtr(0)+ll128Offset; uint64_t flag = recvFlag(0); bool needReload; int spins = 0; do { needReload = false; #pragma unroll for (int u=0; u<ELEMS_PER_THREAD; u+=2) { load128(ptr+u*WARP_SIZE, vr[u], vr[u+1]); needReload |= flagThread && (vr[u+1] != flag); } needReload &= (0 == checkAbort(spins, 0, 0)); } while (__any_sync(WARP_MASK, needReload)); #pragma unroll for (int u=0; u<ELEMS_PER_THREAD; u+=2) load128(ptr+u*WARP_SIZE, vr[u], vr[u+1]); }
mark
Hi, I am confused about the code segment here, https://github.com/NVIDIA/nccl/blob/master/src/device/prims_ll128.h#L250
In Do-While-Loop, we already load the data to vr[] array When checking the flag right.
However, in the next For-Loop we re-load the data again. Can the re-load process be deleted? Why can't?