antirez / disque

Disque is a distributed message broker
BSD 3-Clause "New" or "Revised" License
8.01k stars 537 forks source link

Jobs not replicated across all nodes. #166

Open nadarkathiresh opened 8 years ago

nadarkathiresh commented 8 years ago

This is the steps i took.

  1. Start two disque-server and have them cluster meet
  2. Adding jobs to one server using replicate=2, i used 20KB payload.
  3. I Added 100 thousand jobs in 30 seconds to the first server.
  4. When i check in info of both servers, this server on which i pushed i can see 100 thousand jobs , but on the other server its always very less, varies from 3K to 40K.

Is this a bug, or does it require some config tweaking of disque.

There is a network latency of around 250ms between both nodes.

Thanks Kathiresh

antirez commented 8 years ago

Hello, thanks for reporting, what command do you use to add jobs exactly? What field do you check to see the number of jobs? The INFO field or the queue length? Thanks.

nadarkathiresh commented 8 years ago

Hi Antirez,

I am using a Perl library Disque (http://search.cpan.org/~lovelle/Disque-0.01/lib/Disque.pm).

The command is $str has 20kb Payload.

$disque->add_job("list_job_queue",$str,0,"REPLICATE","2", "ASYNC");

Thanks

antirez commented 8 years ago

Ok, the problem is that, with ASYNC, you are asking for best-effort replication, so for example if a node runs out of memory (the default maxmemory is 1GB), it will discard the job silently and you'll end with just one copy. Similarly if the link between the nodes disconnects and reconnects, you lose a copy of all the jobs added during this time.

nadarkathiresh commented 8 years ago

I have give 10gb Memory in both nodes configs, Also both the nodes are in good data centers with 1gbps connectivity. I will try to replicate the same in single data center and check. Some times i also get this error.

[ADDJOB] NOREPL Not enough reachable nodes for the requested replication level, at /usr/local/share/perl5/Disque.pm line 219. at disq_push.pl line 23.

antirez commented 8 years ago

@nadarkathiresh that's a good hint, of from time to time Disque gives you this error, it means that there are moments where the two nodes cannot talk for an extended amount of time (check the node timeout setting you are using). And this is likely closely related to the other issue.

Another thing you can try is to remove the ASYNC just for testing, and set a timeout of 2000 (2 seconds), so that you can check if there are jobs that fail to be replicated within two seconds.

nadarkathiresh commented 8 years ago

Ok i will try that.

nadarkathiresh commented 8 years ago

I removed ASYNC and set timeout to 2000 ( 2 secs), the jobs where replicated properly, it took 4 minutes for 1000 jobs.

Next i added ASYNC and with 2 sec timeout tested again. But only 7K jobs were replicated and no error received. In the master node 100,000 req are added in 30 secs

Is there any option to retry failed job replication.

antirez commented 8 years ago

4 minutes for 1000 jobs is expected given that your RTT is 250 milliseconds:

(0.25*1000)/60 = 4.1666666666

So this looks ok. However without async, Disque will retry even if the connection goes down. With ASYNC the timeout is meaningless, the server will not wait at all for replies, it will attempt a best effort replication: it sends a copy to the specified number of jobs, without ever caring if the target node reached the job or not. So there is no way to retry. When you want to be sure, normally you use synchronous replication, that provides the desired level of safety.

However what you are observing here, is a lot of failed jobs in normal conditions, without network partitions AFAIK? So perhaps Disque nodes are connecting and disconnecting continuously? This could be the case if, for example, the node timeout setting is wrong. You could use the CLUSTER NODES command multiple times, in order to check if the nodes ever go to disconnected state.

nadarkathiresh commented 8 years ago

Having 4 req/sec speed will not serve my purpose, as i will be getting 1000s of request/second and the processing of the data can only be done of the server which is at the other end. Currently we cannot move it near to the request accepting server.

Any way to acheive this using Redis/Disque.

Thanks.

antirez commented 8 years ago

I don't mean that you must switch to synchronous replication, what I mean is that ASYNC + serious node to node link issues will result into lost jobs. Today I'll try a similar setup of yours across WAN to see what my results are, but I expect all the jobs to be replicated most of the times. More news ASAP.

antirez commented 8 years ago

@nadarkathiresh I tried to reproduce the issue, and indeed, while not with the same severity, I experienced a 10% jobs loss for some reason: either on WANs with high traffic the nodes disconnect for a failure on the failures detector (because the link is busy transferring other stuff, pings cannot be exchanged) or something like that. Investigating what the problem could be...

antirez commented 8 years ago

Hello @nadarkathiresh, please could you check, when this happens, what is the outgoing bandwidth you are using between the two nodes, and if actually the bandwidth reaches the max you can use? Thanks.

nadarkathiresh commented 8 years ago

Hello @antirez yes the bandwidth is reaching out to the max allowed i think. I am getting 10MBs when i try to download in between the servers. So bandwidth will be a limitation between the two servers. So now how to we tackle this problem. My total payload comes to 2GB for 100 thousand requests which is completed in 30 seconds. So will required 70MBs line for instant sync across the data centers.

Thanks Kathiresh

nadarkathiresh commented 8 years ago

Hello @antirez any update on this bug.

stevenross commented 8 years ago

@antirez What was your setup to reproduce this bug? I'm looking at using Disque but this issue worries me. If I can reproduce this bug locally hopefully I can contribute to the project with a fix.