Open Skoucail opened 3 years ago
if you are OK to restore your data, just re-setup your cluster. Remove all the node-6 config, dump.rdb, aof and so on?
then recreate --cluster again. The issue will gone.
How can i restore a backup if you are asking me to delete the dump.rdb file? We are using redis as a datastore. Not purely as a cache. So the applications can not recreate all the data.
I can remove and recreate the redisearch indexes. But that doesn't seem to help.
But this seem to be a pure RSCoordinator issue, as my Redis cluster seems to be working correctly.
I cant imagine you going to ask everyone who is upgrading from 2.0 to 2.2 to delete all their data and recreate the clusters...
@Skoucail I thought that you are reusing the old clustered data before init the RSCoordinator.
However, i retest, after a while, multiple result came back, i think it's a bug.
And I m using version v2.0.10 .
@Taymindis No problem. We are using RSCoordinator from the start of the cluster. But we are experiencing some memory issues (thats an other issue #262) because of that we are updating pretty frequently.
We are deploying with docker so we are building from github source branches. Our build from the 2.0 branch on Jul 12, 2021 (this should be version 2.0.10) worked without issues. After upgrading on Nov 23, 2021 from the 2.2 branch (this should be version version 2.2.5) the double results started.
@Skoucail
There could be a lot of consequences of this issue happening, but what I can think of one issue might be
However, I will keep looking on that to ensure no issue happen again. Hope this is not a bug.
@Taymindis Well when doing a HGETALL (for example) of the data it works. When running the FT.PROFILE it shows a number of shards equal to the number of redis nodes in the cluster (in my case 6) And finds the data on 2 of them (in my case). Again normal, as in my setup 3 masters, 3 replica's the data should be on 2 nodes (1 master and 1 replica).
But I think the bug happens when combining the results from all the nodes. Or deciding what nodes (only master OR slave) they should query.
>> FT.PROFILE index:track:ids SEARCH QUERY @TrackID:{205020000} NOCONTENT
1) 1) (integer) 2
2) "track:trackid:205020000"
3) "track:trackid:205020000"
2) 1) "Shard #1"
2) 1) "Total profile time"
2) "0.080000000000000002"
3) 1) "Parsing time"
2) "0.036999999999999998"
4) 1) "Pipeline creation time"
2) "0.0070000000000000001"
5) 1) "Iterators profile"
2) 1) "Type"
2) "EMPTY"
3) "Time"
4) "0.0040000000000000001"
5) "Counter"
6) (integer) 0
6) 1) "Result processors profile"
2) 1) "Type"
2) "Index"
3) "Time"
4) "0.010999999999999999"
5) "Counter"
6) (integer) 0
3) 1) "Type"
2) "Scorer"
3) "Time"
4) "0.0080000000000000002"
5) "Counter"
6) (integer) 0
4) 1) "Type"
2) "Sorter"
3) "Time"
4) "0.0080000000000000002"
5) "Counter"
6) (integer) 0
7) "Shard #2"
8) 1) "Total profile time"
2) "0.123"
9) 1) "Parsing time"
2) "0.037999999999999999"
10) 1) "Pipeline creation time"
2) "0.0089999999999999993"
11) 1) "Iterators profile"
2) 1) "Type"
2) "TAG"
3) "Term"
4) "205020000"
5) "Time"
6) "0.0080000000000000002"
7) "Counter"
8) (integer) 1
9) "Size"
10) (integer) 1
12) 1) "Result processors profile"
2) 1) "Type"
2) "Index"
3) "Time"
4) "0.027"
5) "Counter"
6) (integer) 1
3) 1) "Type"
2) "Scorer"
3) "Time"
4) "0.016999999999999998"
5) "Counter"
6) (integer) 1
4) 1) "Type"
2) "Sorter"
3) "Time"
4) "0.016"
5) "Counter"
6) (integer) 1
13) "Shard #3"
14) 1) "Total profile time"
2) "0.10199999999999999"
15) 1) "Parsing time"
2) "0.050000000000000003"
16) 1) "Pipeline creation time"
2) "0.012"
17) 1) "Iterators profile"
2) 1) "Type"
2) "EMPTY"
3) "Time"
4) "0.0040000000000000001"
5) "Counter"
6) (integer) 0
18) 1) "Result processors profile"
2) 1) "Type"
2) "Index"
3) "Time"
4) "0.014"
5) "Counter"
6) (integer) 0
3) 1) "Type"
2) "Scorer"
3) "Time"
4) "0.0079999999999999984"
5) "Counter"
6) (integer) 0
4) 1) "Type"
2) "Sorter"
3) "Time"
4) "0.0080000000000000002"
5) "Counter"
6) (integer) 0
19) "Shard #4"
20) 1) "Total profile time"
2) "0.096000000000000002"
21) 1) "Parsing time"
2) "0.040000000000000001"
22) 1) "Pipeline creation time"
2) "0.0070000000000000001"
23) 1) "Iterators profile"
2) 1) "Type"
2) "EMPTY"
3) "Time"
4) "0.012"
5) "Counter"
6) (integer) 0
24) 1) "Result processors profile"
2) 1) "Type"
2) "Index"
3) "Time"
4) "0.02"
5) "Counter"
6) (integer) 0
3) 1) "Type"
2) "Scorer"
3) "Time"
4) "0.0069999999999999993"
5) "Counter"
6) (integer) 0
4) 1) "Type"
2) "Sorter"
3) "Time"
4) "0.0080000000000000036"
5) "Counter"
6) (integer) 0
25) "Shard #5"
26) 1) "Total profile time"
2) "0.109"
27) 1) "Parsing time"
2) "0.062"
28) 1) "Pipeline creation time"
2) "0.0080000000000000002"
29) 1) "Iterators profile"
2) 1) "Type"
2) "EMPTY"
3) "Time"
4) "0.0040000000000000001"
5) "Counter"
6) (integer) 0
30) 1) "Result processors profile"
2) 1) "Type"
2) "Index"
3) "Time"
4) "0.012"
5) "Counter"
6) (integer) 0
3) 1) "Type"
2) "Scorer"
3) "Time"
4) "0.0080000000000000002"
5) "Counter"
6) (integer) 0
4) 1) "Type"
2) "Sorter"
3) "Time"
4) "0.0090000000000000011"
5) "Counter"
6) (integer) 0
31) "Shard #6"
32) 1) "Total profile time"
2) "0.126"
33) 1) "Parsing time"
2) "0.042000000000000003"
34) 1) "Pipeline creation time"
2) "0.0089999999999999993"
35) 1) "Iterators profile"
2) 1) "Type"
2) "TAG"
3) "Term"
4) "205020000"
5) "Time"
6) "0.0080000000000000002"
7) "Counter"
8) (integer) 1
9) "Size"
10) (integer) 1
36) 1) "Result processors profile"
2) 1) "Type"
2) "Index"
3) "Time"
4) "0.025999999999999999"
5) "Counter"
6) (integer) 1
3) 1) "Type"
2) "Scorer"
3) "Time"
4) "0.016000000000000004"
5) "Counter"
6) (integer) 1
4) 1) "Type"
2) "Sorter"
3) "Time"
4) "0.017999999999999995"
5) "Counter"
6) (integer) 1
37) "Coordinator"
38) 1) "Total Coordinator time"
2) "0.81899999999999995"
39) 1) "Post Proccessing time"
2) "0.041000000000000002"
@Skoucail
do you mind to tell me the exact branch version that you are built from? Let me try on my side as I have very simple test to check whether is has multiple result
@Taymindis There are no commits since yesterday so basically the 2.2 branch
git clone -b 2.2 --recurse-submodules https://github.com/RediSearch/RSCoordinator.git
This is the result for v2.2.0, seemed ok to me, 3 master and 3 replica also.
But after i restart all 6 node, the result will become multiple, this is also happening to v2.0
@Taymindis
So guess we confirmed there is an bug. Seeing it only happens on clusters the chance that its RSCoordinator related and not Redis or RediSearch is big.
@Skoucail
I just realized i setup wrong config for persistent file which incurred multiple data, I have re-configed, now every working fine for V2.2
@Taymindis
Can you explain what I need to check/change? Because for me it only started failing after updating to v2.2.
@Skoucail
I have no idea how you config the datastore as I dont have experience on that
For my configuration is very basic,
6 Nodes, (3master, 3slave) redis-6379.conf to 6879.conf
@Taymindis
We are running each redis node in a different docker => dumpdbfile, AOF and nodeconfig are unique. And we are still having the doubles....
And again, we had no issues in v2.0. So some bug must have been introduced in the v2.2.
@Skoucail
But now I have no issue on v2.2(i am not running docker).
Probably you are right, it could be an issue with running docker.
@Taymindis
What version of redis are you using? We are using 6.2.6 Did you try reboots and failovers? Maybe those are screwing things up
Is it normal RSCoordinator goes to all nodes (master and slaves) to get results? I would think for performance it only goes to masters or slaves instead of all nodes.
Otherwise i dont know either We were running 3 identical RSCoordinator cluster (alpha, beta and prod) We upgraded the alpha to v2.2 => issues with doubles on all FT.SEARCH The other 2 clusters are running 2.0 (2.0.10) and have zero issues.
Also downgrading seems to be an issue, if i try to go back to v2.0 redis is just not starting up anymore. With some module failed errors.
@Skoucail
My Redis Version is
Redis server v=6.2.6 sha=00000000:0 malloc=libc bits=64 build=4642fd5b30d113c8
I had 3Master, 3Slave rebooted one by one, failover back and forth, So far so good.
The only difference is you have a datastore, I am not sure how you configure it, maybe you have some reference that I can set up?
Should we chat on redis discord
@Taymindis
I joined the discord (nickname Skoucail) With datastore i just mean that we are not using redis as a cache, so we are not really able to just delete everything and resetup the cluster. As that would mean we have dataloss.
@ashtul
Can you have a look at this? Our best guess is that for some reason in my cluster RSCoordinator queries all nodes (slaves and master) resulting in the FT.PROFILE showing 6 shards (where 2 of them give the same result)
Taymindis has setup a clean cluster (6 nodes, 3 master, 3 slave) with just a few keys in it. When he does a FT.PROFILE it shows only 3 shards (not 6) with 1 shard giving the result.
After some code digging I did find this comment in FlatSearchCommandHandler The comment says it should only be executed on the masters. But in the MR_SetCoordinationStrategy MRCluster_MastersOnly isn't passed. https://github.com/RediSearch/RSCoordinator/blob/2d23987cbfd266965d9b35329dc5d4252365471e/src/module.c#L1228 This is the same for v2.0 tho. So not sure if this is the issue... A lot had changed in v2.2 like the new function DistSearchCommand with ConcurrentSearch_ThreadPoolRun. Maybe the combination of the use of a threadpool and the missing MasterOnly flag is the problem...
Btw downgrading gives this error on redis startup:
# Guru Meditation: Error loading data from RDB (short read or EOF). Read performed by module 'search' about type 'ft_index0' after reading '0' bytes of a value for key named: '(null)'. #module.c:4608
After updating to version 2.2 (from the 2.2 branch) all my search queries give double results.
example
After running some FT.PROFILE commands it looks like both the results from the master and the replica are returned. Maybe the distinct on key is gone missing?