cq_dequeue and move_head_cq

ZaidQureshi / bam

BSD 2-Clause "Simplified" License

128 stars 32 forks source link

cq_dequeue and move_head_cq #7

Closed lineagech closed 1 year ago

lineagech commented 2 years ago

I ran into an issue: threads enqueued cmds, and then the threads with larger pos (the second parameter of cq_dequeue) entered cq_dequeue and called move_head_cq earlier, which could not make cq->head_mark UNLOCKED because threads with smaller pos would not make cq->head_mark LOCKED (not being scheduled). The return value of move_head_cq (head_move_count) was always 0. Is it a known issue? Thank you!

uint32_t move_head_cq(nvm_queue_t* q, uint32_t cur_head, nvm_queue_t* sq) {
    uint32_t count = 0;
    (void) sq;

    bool pass = true;
    //uint32_t old_head;
    while (pass) {
        uint32_t loc = (cur_head+count++)&q->qs_minus_1;
        pass = (q->head_mark[loc].val.exchange(UNLOCKED, simt::memory_order_relaxed)) == LOCKED;

ZaidQureshi commented 2 years ago

I think that case you mentioned should be ok as long as the thread with the smaller pos will eventually make its cq->head_mark value UNLOCKED. If that thread is never reaching there then thats a different problem.

lineagech commented 2 years ago

I think that case you mentioned should be ok as long as the thread with the smaller pos will eventually make its cq->head_mark value UNLOCKED. If that thread is never reaching there then thats a different problem.

Hmm... I am suspecting the threads with smaller pos is never reaching from the trace I got. Has had a solution to this? One workaround I can think of is that making the thread getting the cq->head_lock processes the previous entries (smaller pos). But not sure if this would make the whole system slow down.

ZaidQureshi commented 2 years ago

I am not sure thats a good idea. If the thread with smaller pos is never reaching then that is the real problem. can you share what cuda toolkit version you are using, the program and parameters you are using that is causing the issue, and any trace to point out the exact problem?

msharmavikram commented 1 year ago

I believe this is no longer an issue based on last few conversations we had. Closing this.