dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
25.8k stars 948 forks source link

BZPop bug #3276

Closed adiholden closed 3 months ago

adiholden commented 4 months ago

E20240706 06:12:17.647733 2255 zset_family.cc:1343] BUG: Callback didn't run! BLOCK//claimed=1 coord_state=6 local_res=0/ E20240706 06:12:11.367316 2255 zset_family.cc:1343] BUG: Callback didn't run! BLOCK//claimed=1 coord_state=6 local_res=0/

romange commented 4 months ago

We should not put private links in our public repo (for security considerations).

BorysTheDev commented 4 months ago

@adiholden @romange I have found only one possible issue. We have 2 steps: 1) Check key presence 2) trans->execute() During 2nd step we lock db_slice::cbmu and can preempt so if the key is removed we will execute a transaction with a deprecated iterator. This bug should be fixed by #3229

BorysTheDev commented 4 months ago

I haven't found using of cluster, replication but the save operation happened on July 10, 2024 2:59:31 PM (GMT)

BorysTheDev commented 3 months ago

I've found one more bug with KEY_MOVE error and have fixed it https://github.com/dragonflydb/dragonfly/pull/3334

But it looks like we have at least one more

BorysTheDev commented 3 months ago

I haven't found the real reason for this bug, but it looks like the execution order was next:

1) run trans->WaitOnWatch(...) 2) run trans->CancelBlocking(...) 2.1) Transaction::coordinatorstate |= COORD_CANCELLED; 2.2) Transaction::localresult = status; // where status != OK 3) Transaction::localresult somehow should be set OK 4) trans->WaitOnWatch(..) returns Transaction::localresult // that is already OK

The problem is that I don't see how the 3rd step can happen. Also, it looks like the 2nd step was done from the connection closing code:

  owner->RegisterBreakHook([res, this](uint32_t) {
    if (res->transaction)
      res->transaction->CancelBlocking(nullptr);
    this->server_family().BreakOnShutdown();
  });
romange commented 3 months ago

ok, how can you prove that 2 has happened, and it was due to RegisterBreakHook?

BorysTheDev commented 3 months ago

coord_state=6 means that the operation was canceled, I think about RegisterBreakHook because we don't have a replication and a cluster so only one option is left

dranikpg commented 3 months ago

Based on the logs... isn't the key an empty string?

BorysTheDev commented 3 months ago

Based on the logs... isn't the key an empty string?

I took it into account that is why I have written about the 3rd step

dranikpg commented 3 months ago

Fixed with #3371