dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
24.46k stars 884 forks source link

Eval command does not returns, transaction getting stuck #3232

Closed adiholden closed 1 day ago

adiholden commented 3 days ago

How to reproduce:

Run the following commands in redis cli:

  1. script load " local pattern = KEYS[1]\n local oldest = tonumber(ARGV[1])\n local cursor = ARGV[2]\n local count = tonumber(ARGV[3])\n local result = redis.call(\"SCAN\", cursor, \"MATCH\", pattern, \"COUNT\", count)\n cursor = result[1]\n local keys = result[2]\n for i, key in ipairs(keys) do\n local to_delete = {}\n for _, timestamp in ipairs(redis.call('hkeys', key)) do\n if tonumber(timestamp) <= oldest then\n table.insert(to_delete, timestamp)\n end\n end\n if #to_delete > 0 then\n redis.call('hdel', key, unpack(to_delete))\n end\n end\n return cursor\n"
  2. debug populate 1000
  3. evalsha 591f96f821898bddced228c3fe18d0192691bd05 1 key: 1 0 1000

step 3 does not return causing transaction getting stuck

adiholden commented 3 days ago

The problem is with running scan command under script. scan is not transactional and uses EngineShardSet Await to run the scanop on a shard.

If you run dragonfly with the following flags ./dragonfly --alsologtostderr --dbfilename= --proactor_threads=6 --vmodule=generic_family=1 you can see that the scanop is getting deadlocked when trying to run the scanop on the shard the script was scheduled on (the shard of the key provided for the script)

adiholden commented 3 days ago

We can reproduce the bug also with this simple script eval "return redis.call('scan', '0', 'match', 'key:', 'count', '1000')" 1 key:

romange commented 3 days ago

Should we just refuse running SCAN in the script (return error) ?

romange commented 3 days ago

Another option is to support patterns that have specific hashslot inside like {foo}. In that case if we identify that scan is single sharded we can continue running it in the script under lock_on_hashtag mode

dranikpg commented 3 days ago

Should we just refuse running SCAN in the script (return error) ?

I assume this is a very common usecase