RediSearch / RSCoordinator

RediSearch coordinator
Other
23 stars 10 forks source link

[BUG] Version 2.2 returns double results #326

Open Skoucail opened 2 years ago

Skoucail commented 2 years ago

After updating to version 2.2 (from the 2.2 branch) all my search queries give double results.

example image

After running some FT.PROFILE commands it looks like both the results from the master and the replica are returned. Maybe the distinct on key is gone missing?

Taymindis commented 2 years ago

if you are OK to restore your data, just re-setup your cluster. Remove all the node-6 config, dump.rdb, aof and so on?

then recreate --cluster again. The issue will gone.

Skoucail commented 2 years ago

How can i restore a backup if you are asking me to delete the dump.rdb file? We are using redis as a datastore. Not purely as a cache. So the applications can not recreate all the data.

I can remove and recreate the redisearch indexes. But that doesn't seem to help.

But this seem to be a pure RSCoordinator issue, as my Redis cluster seems to be working correctly.

I cant imagine you going to ask everyone who is upgrading from 2.0 to 2.2 to delete all their data and recreate the clusters...

Taymindis commented 2 years ago

@Skoucail I thought that you are reusing the old clustered data before init the RSCoordinator.

However, i retest, after a while, multiple result came back, i think it's a bug.

And I m using version v2.0.10 .

Skoucail commented 2 years ago

@Taymindis No problem. We are using RSCoordinator from the start of the cluster. But we are experiencing some memory issues (thats an other issue #262) because of that we are updating pretty frequently.

We are deploying with docker so we are building from github source branches. Our build from the 2.0 branch on Jul 12, 2021 (this should be version 2.0.10) worked without issues. After upgrading on Nov 23, 2021 from the 2.2 branch (this should be version version 2.2.5) the double results started.

Taymindis commented 2 years ago

@Skoucail

There could be a lot of consequences of this issue happening, but what I can think of one issue might be

However, I will keep looking on that to ensure no issue happen again. Hope this is not a bug.

Skoucail commented 2 years ago

@Taymindis Well when doing a HGETALL (for example) of the data it works. When running the FT.PROFILE it shows a number of shards equal to the number of redis nodes in the cluster (in my case 6) And finds the data on 2 of them (in my case). Again normal, as in my setup 3 masters, 3 replica's the data should be on 2 nodes (1 master and 1 replica).

But I think the bug happens when combining the results from all the nodes. Or deciding what nodes (only master OR slave) they should query.

>> FT.PROFILE index:track:ids SEARCH QUERY @TrackID:{205020000} NOCONTENT

1) 1) (integer) 2
   2) "track:trackid:205020000"
   3) "track:trackid:205020000"
2)  1) "Shard #1"
    2) 1) "Total profile time"
       2) "0.080000000000000002"
    3) 1) "Parsing time"
       2) "0.036999999999999998"
    4) 1) "Pipeline creation time"
       2) "0.0070000000000000001"
    5) 1) "Iterators profile"
       2) 1) "Type"
          2) "EMPTY"
          3) "Time"
          4) "0.0040000000000000001"
          5) "Counter"
          6) (integer) 0
    6) 1) "Result processors profile"
       2) 1) "Type"
          2) "Index"
          3) "Time"
          4) "0.010999999999999999"
          5) "Counter"
          6) (integer) 0
       3) 1) "Type"
          2) "Scorer"
          3) "Time"
          4) "0.0080000000000000002"
          5) "Counter"
          6) (integer) 0
       4) 1) "Type"
          2) "Sorter"
          3) "Time"
          4) "0.0080000000000000002"
          5) "Counter"
          6) (integer) 0
    7) "Shard #2"
    8) 1) "Total profile time"
       2) "0.123"
    9) 1) "Parsing time"
       2) "0.037999999999999999"
   10) 1) "Pipeline creation time"
       2) "0.0089999999999999993"
   11) 1) "Iterators profile"
       2)  1) "Type"
           2) "TAG"
           3) "Term"
           4) "205020000"
           5) "Time"
           6) "0.0080000000000000002"
           7) "Counter"
           8) (integer) 1
           9) "Size"
          10) (integer) 1
   12) 1) "Result processors profile"
       2) 1) "Type"
          2) "Index"
          3) "Time"
          4) "0.027"
          5) "Counter"
          6) (integer) 1
       3) 1) "Type"
          2) "Scorer"
          3) "Time"
          4) "0.016999999999999998"
          5) "Counter"
          6) (integer) 1
       4) 1) "Type"
          2) "Sorter"
          3) "Time"
          4) "0.016"
          5) "Counter"
          6) (integer) 1
   13) "Shard #3"
   14) 1) "Total profile time"
       2) "0.10199999999999999"
   15) 1) "Parsing time"
       2) "0.050000000000000003"
   16) 1) "Pipeline creation time"
       2) "0.012"
   17) 1) "Iterators profile"
       2) 1) "Type"
          2) "EMPTY"
          3) "Time"
          4) "0.0040000000000000001"
          5) "Counter"
          6) (integer) 0
   18) 1) "Result processors profile"
       2) 1) "Type"
          2) "Index"
          3) "Time"
          4) "0.014"
          5) "Counter"
          6) (integer) 0
       3) 1) "Type"
          2) "Scorer"
          3) "Time"
          4) "0.0079999999999999984"
          5) "Counter"
          6) (integer) 0
       4) 1) "Type"
          2) "Sorter"
          3) "Time"
          4) "0.0080000000000000002"
          5) "Counter"
          6) (integer) 0
   19) "Shard #4"
   20) 1) "Total profile time"
       2) "0.096000000000000002"
   21) 1) "Parsing time"
       2) "0.040000000000000001"
   22) 1) "Pipeline creation time"
       2) "0.0070000000000000001"
   23) 1) "Iterators profile"
       2) 1) "Type"
          2) "EMPTY"
          3) "Time"
          4) "0.012"
          5) "Counter"
          6) (integer) 0
   24) 1) "Result processors profile"
       2) 1) "Type"
          2) "Index"
          3) "Time"
          4) "0.02"
          5) "Counter"
          6) (integer) 0
       3) 1) "Type"
          2) "Scorer"
          3) "Time"
          4) "0.0069999999999999993"
          5) "Counter"
          6) (integer) 0
       4) 1) "Type"
          2) "Sorter"
          3) "Time"
          4) "0.0080000000000000036"
          5) "Counter"
          6) (integer) 0
   25) "Shard #5"
   26) 1) "Total profile time"
       2) "0.109"
   27) 1) "Parsing time"
       2) "0.062"
   28) 1) "Pipeline creation time"
       2) "0.0080000000000000002"
   29) 1) "Iterators profile"
       2) 1) "Type"
          2) "EMPTY"
          3) "Time"
          4) "0.0040000000000000001"
          5) "Counter"
          6) (integer) 0
   30) 1) "Result processors profile"
       2) 1) "Type"
          2) "Index"
          3) "Time"
          4) "0.012"
          5) "Counter"
          6) (integer) 0
       3) 1) "Type"
          2) "Scorer"
          3) "Time"
          4) "0.0080000000000000002"
          5) "Counter"
          6) (integer) 0
       4) 1) "Type"
          2) "Sorter"
          3) "Time"
          4) "0.0090000000000000011"
          5) "Counter"
          6) (integer) 0
   31) "Shard #6"
   32) 1) "Total profile time"
       2) "0.126"
   33) 1) "Parsing time"
       2) "0.042000000000000003"
   34) 1) "Pipeline creation time"
       2) "0.0089999999999999993"
   35) 1) "Iterators profile"
       2)  1) "Type"
           2) "TAG"
           3) "Term"
           4) "205020000"
           5) "Time"
           6) "0.0080000000000000002"
           7) "Counter"
           8) (integer) 1
           9) "Size"
          10) (integer) 1
   36) 1) "Result processors profile"
       2) 1) "Type"
          2) "Index"
          3) "Time"
          4) "0.025999999999999999"
          5) "Counter"
          6) (integer) 1
       3) 1) "Type"
          2) "Scorer"
          3) "Time"
          4) "0.016000000000000004"
          5) "Counter"
          6) (integer) 1
       4) 1) "Type"
          2) "Sorter"
          3) "Time"
          4) "0.017999999999999995"
          5) "Counter"
          6) (integer) 1
   37) "Coordinator"
   38) 1) "Total Coordinator time"
       2) "0.81899999999999995"
   39) 1) "Post Proccessing time"
       2) "0.041000000000000002"
Taymindis commented 2 years ago

@Skoucail

do you mind to tell me the exact branch version that you are built from? Let me try on my side as I have very simple test to check whether is has multiple result

Skoucail commented 2 years ago

@Taymindis There are no commits since yesterday so basically the 2.2 branch

git clone -b 2.2 --recurse-submodules https://github.com/RediSearch/RSCoordinator.git

Taymindis commented 2 years ago

image

This is the result for v2.2.0, seemed ok to me, 3 master and 3 replica also.

But after i restart all 6 node, the result will become multiple, this is also happening to v2.0 image

Skoucail commented 2 years ago

@Taymindis

So guess we confirmed there is an bug. Seeing it only happens on clusters the chance that its RSCoordinator related and not Redis or RediSearch is big.

Taymindis commented 2 years ago

@Skoucail

I just realized i setup wrong config for persistent file which incurred multiple data, I have re-configed, now every working fine for V2.2

Skoucail commented 2 years ago

@Taymindis

Can you explain what I need to check/change? Because for me it only started failing after updating to v2.2.

Taymindis commented 2 years ago

@Skoucail

I have no idea how you config the datastore as I dont have experience on that

For my configuration is very basic,

6 Nodes, (3master, 3slave) redis-6379.conf to 6879.conf

Skoucail commented 2 years ago

@Taymindis

We are running each redis node in a different docker => dumpdbfile, AOF and nodeconfig are unique. And we are still having the doubles....

And again, we had no issues in v2.0. So some bug must have been introduced in the v2.2.

Taymindis commented 2 years ago

@Skoucail

But now I have no issue on v2.2(i am not running docker).

Probably you are right, it could be an issue with running docker.

Skoucail commented 2 years ago

@Taymindis

What version of redis are you using? We are using 6.2.6 Did you try reboots and failovers? Maybe those are screwing things up

Is it normal RSCoordinator goes to all nodes (master and slaves) to get results? I would think for performance it only goes to masters or slaves instead of all nodes.

Otherwise i dont know either We were running 3 identical RSCoordinator cluster (alpha, beta and prod) We upgraded the alpha to v2.2 => issues with doubles on all FT.SEARCH The other 2 clusters are running 2.0 (2.0.10) and have zero issues.

Also downgrading seems to be an issue, if i try to go back to v2.0 redis is just not starting up anymore. With some module failed errors.

Taymindis commented 2 years ago

@Skoucail

My Redis Version is

Redis server v=6.2.6 sha=00000000:0 malloc=libc bits=64 build=4642fd5b30d113c8

I had 3Master, 3Slave rebooted one by one, failover back and forth, So far so good.

The only difference is you have a datastore, I am not sure how you configure it, maybe you have some reference that I can set up?

Should we chat on redis discord

Skoucail commented 2 years ago

@Taymindis

I joined the discord (nickname Skoucail) With datastore i just mean that we are not using redis as a cache, so we are not really able to just delete everything and resetup the cluster. As that would mean we have dataloss.

Skoucail commented 2 years ago

@ashtul

Can you have a look at this? Our best guess is that for some reason in my cluster RSCoordinator queries all nodes (slaves and master) resulting in the FT.PROFILE showing 6 shards (where 2 of them give the same result)

Taymindis has setup a clean cluster (6 nodes, 3 master, 3 slave) with just a few keys in it. When he does a FT.PROFILE it shows only 3 shards (not 6) with 1 shard giving the result.

After some code digging I did find this comment in FlatSearchCommandHandler image The comment says it should only be executed on the masters. But in the MR_SetCoordinationStrategy MRCluster_MastersOnly isn't passed. https://github.com/RediSearch/RSCoordinator/blob/2d23987cbfd266965d9b35329dc5d4252365471e/src/module.c#L1228 This is the same for v2.0 tho. So not sure if this is the issue... A lot had changed in v2.2 like the new function DistSearchCommand with ConcurrentSearch_ThreadPoolRun. Maybe the combination of the use of a threadpool and the missing MasterOnly flag is the problem...

Skoucail commented 2 years ago

Btw downgrading gives this error on redis startup: # Guru Meditation: Error loading data from RDB (short read or EOF). Read performed by module 'search' about type 'ft_index0' after reading '0' bytes of a value for key named: '(null)'. #module.c:4608