ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.77k stars 539 forks source link

__syncwarp(mask) question #3668

Closed jinz2014 closed 1 week ago

jinz2014 commented 1 week ago

What is the HIP version of __syncwarp(mask) ?

jinz2014 commented 1 week ago
__device__
loc_ht& ht_get_atomic(loc_ht* thread_ht, cstr_type kmer_key, uint32_t max_size){
    unsigned hash_val = MurmurHashAligned2(kmer_key, max_size);
    unsigned orig_hash = hash_val;

    while(true){
        int prev = atomicCAS(&thread_ht[hash_val].key.length, EMPTY, kmer_key.length);
        int mask = __match_any_sync(__activemask(), (unsigned long long)&thread_ht[hash_val]); // all the threads in the warp which have same address

        if(prev == EMPTY){
            thread_ht[hash_val].key.start_ptr = kmer_key.start_ptr;
            thread_ht[hash_val].val = {.hi_q_exts = {0}, .low_q_exts = {0}, .ext = 0, .count = 0};
        }
        __syncwarp(mask);
        if(prev != EMPTY && thread_ht[hash_val].key == kmer_key){
            //printf("key found, returning\n");// keep this for debugging
            return thread_ht[hash_val];
        }else if (prev == EMPTY){
            return thread_ht[hash_val];
        }
        hash_val = (hash_val +1 ) %max_size;//hash_val = (hash_val + 1) & (HT_SIZE -1);
        if(hash_val == orig_hash){ // loop till you reach the same starting positions and then return error
            printf("*****end reached, hashtable full*****\n"); // for debugging
            printf("*****end reached, hashtable full*****\n");
            printf("*****end reached, hashtable full*****\n");
        }
    }
}

@lindsey

jinz2014 commented 1 week ago

The original CUDA program is https://github.com/leannmlindsey/gpu_local_ht

ppanchad-amd commented 1 week ago

Hi @jinz2014. Internal ticket has been created to assist with your question. Thanks!

jamesxu2 commented 1 week ago

Hi @jinz2014, __syncwarp() is one CUDA function that HIP doesn't provide a direct equivalent for. You have a couple of options:

  1. A threadfence_block() which incurs a performance hit (As a memory fence for all threads in the block)
  2. A syncthreads() which is more expensive than (1), as it implements a barrier and memory fence https://rocm.docs.amd.com/projects/HIP/en/latest/reference/cpp_language_extensions.html#memory-fence-instructions
  3. Cooperative groups, which will not be viable as a drop in replacement but require some additional rewriting (though, is most similar, semantics-wise ) https://github.com/ROCm/HIP/issues/2798 / https://rocm.docs.amd.com/projects/HIP/en/latest/reference/cpp_language_extensions.html#synchronization-functions