Possible data loss following node restart

aphyr commented 9 years ago

Let me preface this with: my test could be wrong. I may be misunderstanding Disque's persistence guarantees, or misusing the API, or maybe the test code is wrong somehow... but I think I've ruled out the most obvious mistakes at this point, so I'd like y'all's feedback!

Config file: https://github.com/aphyr/jepsen/blob/efa09187d710e6937bf4b494bc114b9093c42d93/disque/resources/disque.conf Test code: https://github.com/aphyr/jepsen/blob/9a164a18aec97a56012cb1464a6438bcec327634/disque/src/jepsen/disque.clj

This test runs on a five-node disque cluster: inter-node latency is effectively zero, and the network is perfectly reliable. There are no partitions in this test. I've lowered the cluster node timeout to 100ms to speed up cluster convergence.

In this test, five clients perform a mix of enqueues and dequeues (GETJOB followed by ACKJOB) with unique integers as payloads. Clients use a 100-millisecond timeout for all operations, a retry interval of 1 second, and replication factor 3. They perform a random mixture of enqueues and dequeues for 30 seconds, while we restart a single node at a time--with 10 seconds of downtime and 10 seconds of uptime in between. After those failures, we bring every node up, give them 5 seconds to heal and proceed with normal operations, then drain the queues by issuing dequeues on every node until no more jobs come.

I've enabled AOF and set appendfsync to everysec; regardless, this test only kills two nodes, so a replication factor of three should allow every job to remain resident in RAM.

This test appears to lose enqueued jobs.

{:valid? false,
 :queue {:valid? true, :final-queue {:pending #{13 25 35 19}}},
 :total-queue
 {:valid? false,
  :lost #{13 19},
  :unexpected #{},
  :recovered #{},
  :ok-frac 35/37,
  :unexpected-frac 0,
  :lost-frac 1/37,
  :recovered-frac 0},
 :latency {:valid? true, :file "report//latency.png"}}

As you can see in the full history, job 13 is successfully enqueued but never appears in any dequeue.

latency

The latency plot shows a few enqueue/dequeue crashes during the first node's downtime, and a couple after the second, but since at least three nodes were online continuously through this full test, and all clocks are synchronized, and the retry interval is only 100ms, I expect that every enqueued job should arrive.

You can replicate this by cloning Jepsen 9a164a18aec97a56012cb1464a6438bcec327634 and running lein test in the disque directory.

Any idea what might be going on here?

antirez commented 9 years ago

Hello @aphyr. Let me preface this: thank you a lot for testing Disque. It is the best thing we can get at this stage, since there is time to fix stuff. We can catch logical and implementation errors at this point where they do no harm to actual users, in order to ship something more battle tested than the usual first release.

That said, before to dig further, reading the test code I believe you may be facing the lame defaults that I never actually fixed, and I would like to have some hint from you about this. So here are our defaults:

Replication default is 3 nodes if there are 3 or more nodes. However if there are 2 or 1 nodes, it will try to replicate to all the nodes. (this default should be sane).
TTL is arbitrarily fixed to 1 day if not specified. Probably NOT a sane default. Better to be more conservative?
And finally, the RETRY time which may be the cause of this bug, is set by default if not specified to TTL/10. It means that failed deliveries (and restarting a node is equivalent to failed deliveries, so the messages will be enqueued again later) will be retried again in 2.4 hours by default. So the test needs to run forever... in order for the eventual property of message successful delivery to materialize.

I think those are silly default but I want to get some feedback from messaging system users about what are sane ones. Probably the RETRY value should not be bound to TTL but just some fixed smaller time, and TTL should be ways larger, to have a safer default.

If this is not the case and actually your code already uses a small retry time, I'll investigate further the next possible cause. Otherwise you may try a RETRY small enough for Jepsen to wait at least N times the retry time set before to sense a message as lost forever.

Thanks!

Moreover, there are (probably unrelated issues to this problem you signaled) bugs that are fixed in pending PRs, I'll fix them in the next hours so that maybe in the next test you can already use less buggy code.

I'm ready yo provide any support needed for your testing. Please feel free to ping me anything via email or other means when you have any kind of doubt, since our documentation is extremely laking at this point.

antirez commented 9 years ago

Hi again @aphyr, in the previous hours I merged the fix for an unrelated bug, and added a new test in unit 07 which is very similar to the test you reported in this issue. The test is introduced by commit f00dd07 but the gist of it is:

Add a job with replication 3.
Mass restart all the nodes (with AOF one-sec enabled).
Check that the job is still delivered.

Note that even if AOF has a weak fsync policy, in the test only the processes are crashed, not the whole computer, so the AOF must be able to guarantee that all the data is there, since it only issues the write on the file before sending a positive reply to the client. However for semantical simplicity we will publicize that AOF retains all data only when fsync is set to always, and there are otherwise windows for data loss. But for the test I prefer to use everysec in order to test that writes are always performed before re-entering the event loop and sending replies to clients (that are otherwise just kept in the buffers).

aphyr commented 9 years ago

Replication default is 3 nodes if there are 3 or more nodes. However if there are 2 or 1 nodes, it will try to replicate to all the nodes. (this default should be sane).

I agree--probably best to return an error if someone requests a replication factor greater than the cluster size.

TTL is arbitrarily fixed to 1 day if not specified. Probably NOT a sane default. Better to be more conservative?

Yeah, I'd probably default to infinite TTL. I chose a finite one for Riemann and have been kinda regretting it ever since.

And finally, the RETRY time which may be the cause of this bug, is set by default if not specified to TTL/10. It means that failed deliveries (and restarting a node is equivalent to failed deliveries, so the messages will be enqueued again later) will be retried again in 2.4 hours by default. So the test needs to run forever... in order for the eventual property of message successful delivery to materialize.

So I set RETRY to 1 second: https://github.com/aphyr/jepsen/blob/master/disque/src/jepsen/disque.clj#L256, but the Jedisque client could be broken. I'll pull out wireshark and look at the actual packets when I get home tonight.

antirez commented 9 years ago

Thanks for the reply, given that you already set it to 1 and the obvious issue is probably ruled out, I'll try to reproduce with Jepsen so that I can help in trying to figure out where is the bug. For sure it is in some place, and given that Disque is the biggest amount of code it may be in the server itself very likely. I'll post new updates tonight (my TZ) once I run the test.

antirez commented 9 years ago

While trying to setup Jepsen and reproduce, replying to your prev message in a better way:

I agree--probably best to return an error if someone requests a replication factor greater than the cluster size.

Yes this is what happens. Actually it is stricter than that: even if failure detection in this kind of system is not strictly needed, Disque nodes perform failure detection in order to make more interesting choices during synchronous replication (if a node is not available, don't try to replicate to it, in order to minimize latency). So when an ADDJOB command is received with a given replication level, if there are apparently less nodes sensed as available compared to the requested replication level, an error is issued.

Yeah, I'd probably default to infinite TTL. I chose a finite one for Riemann and have been kinda regretting it ever since.

Yes... looks like a good idea. There is no concept of infinity in Disque but at least I can get as far as a 32bit time_t allows me to get, which is, year 2036.

Thanks for the feedbacks.

antirez commented 9 years ago

Hey @aphyr, thanks to the Docker support I was able to get Jepsen up and running on a Linux box in a small amount of time, however the Disque test starts correctly but fails as soon as it enters the drain stage with this stack trace.

I tried to add a few info calls in order to debug the issue, and apparently for some reason dequeue! which usually gets as op argument just a map, during the failure gets instead a list of maps, so assoc fails with an error because of type mismatch. Please take this with a grain of salt since my Clojure is fragile, just reminiscences of my past Scheme experiences.

So I was not able to investigate further since I can't fix this problem. Any help would be appreciated, thanks.

antirez commented 9 years ago

UPDATE for readers from the future, about the stack trace when trying to run the test: we solved this via Twitter. I needed to upgrade the Jepsen core since the Disque test relayed on changes on the Jepsen core. So when installing Jepsen and upgrading it, it is a good idea to run:

cd jepsen/jepsen; lein install

So now I can finally run the Jepsen test and I get the same failure :tada:. Tonight I'll investigate the cause of this issue. Updates ASAP.

antirez commented 9 years ago

Hello @aphyr, finally some news. I found a few things with the test, however even after fixing them, I can't get a clean run for some reason.

So the problems with the test are:

Retry is set to 1000, but this is supposed to be in seconds, so it's 1000 seconds. Just set it to 1.
The drain stage assumes that as soon as GETJOB returns NULL, the queue is empty, this is not the case since the queue may be empty at T1 but the job could be active, and scheduled again for delivery at T2. The fix for this is just to wait at least RETRY*SOME_CONSTANT during drain.

Note that issue 2 happens rarely in practice (but I verified that some failed runs are actually up to that) after fixing 1.

Even after fixing the above problems, I still get error reports like this:

:valid? false,
 :queue {:valid? false, :error "can't dequeue 16"},
 :total-queue
 {:valid? false,
  :lost #{},
  :unexpected #{69 69 75 32 22 22 66 82 78 16 16 18 37},
  :recovered #{},
  :ok-frac 120/127,
  :unexpected-frac 13/127,
  :lost-frac 0,
  :recovered-frac 0},
 :latency {:valid? true, :file "report//latency.png"}}

Even if in the history I see things like this:

 {:value 16, :time 12221118319, :process 4, :type :ok, :f :dequeue}

However the jobs are no longer listed in the lost list, but in the unexpected which I'm not sure what it means. Maybe that a successive delivery of the same job happened?

I hope that the above provides some clue for you, in order to better understand what happens. I appreciate the help you provided during this days, I used extensively the tips you provided me on Twitter, and the ability to put a "wait any key" pause into the test code was very useful.

aphyr commented 9 years ago

Oh, right! Thanks for catching that milliseconds thing. I'll see about introducing a longer waiting period. The unexpected list is just telling us about duplicate deliveries: it thought 13 had already been delivered and so it didn't expect to see it again.

aphyr commented 9 years ago

[duplicate delivery is totally normal behavior in this case; I'll adjust the test to allow it]

aphyr commented 9 years ago

Looks solid for single-node restarts. :)

latency

antirez commented 9 years ago

Great, thank you @aphyr. Let's see what happens with delays and partitions... :-)

antirez / disque

Possible data loss following node restart #92