Closed aphyr closed 8 years ago
Potentially related: the worst data-loss incidents seem related to long-lasting split-brain, where after a 120-second partition isolating a single primary is resolved, the original node and a newly elected node both believe they're the primary--even after exchanging traffic.
Here's a particularly rough case, losing 25% of documents with only primary-isolating partitions:
{:valid? false,
:lost
"#{50..53 125..128 130..149 151..170 172..194 196..217 219..239 241..263 265..286 288..309 311..331 333..355 357..378 380..392 394}",
:recovered
"#{400 448..449 451 453..454 456..459 461..464 466..469 471..474 476..484 486..489 491..494 496..498 516 619 624 629 634 639 644 647..649 651..654 656..659 661..665 667..669 671..675 677..680 682..684 686..689 691 877..879 882..884 887..889 892..894 897..899 902..904 907..909 912..914 917..919}",
:ok
"#{0..49 400 448..449 451 453..454 456..459 461..464 466..469 471..474 476..484 486..489 491..494 496..499 501..537 539..561 563..583 585..606 608..617 619 624 629 634 639 644 647..649 651..654 656..659 661..665 667..669 671..675 677..680 682..684 686..689 691..694 696..713 715..737 739..758 760..783 785..804 806..829 831..848 850 877..879 882..884 887..889 892..894 897..899 902..904 907..909 912..914 917..919 922..924 927..938 940..942 944..956 958..960 962..975 977..979 981..993 995..998 1000..1012 1014..1016 1018..1030 1032..1035 1037..1039}",
:recovered-frac 7/65,
:unexpected-frac 0,
:unexpected "#{}",
:lost-frac 261/1040,
:ok-frac 53/104}
This test is 600 seconds long, where the network is stable for 10 seconds, then all nodes considering themselves the primary are isolated for 120 seconds, then the network heals and the cycle repeats. We give the cluster 20 seconds of guaranteed healthy network, then wait for the health endpoint to return green on all nodes, then wait a while just in case, and do a final read.
:generator (gen/phases
(->> (range)
(map (fn [x] {:type :invoke
:f :add
:value x}))
gen/seq
(gen/stagger 1/10)
(gen/delay 1)
(gen/nemesis
(gen/seq (cycle
[(gen/sleep 10)
{:type :info :f :start}
(gen/sleep 120)
{:type :info :f :stop}])))
(gen/time-limit 600))
(gen/nemesis
(gen/phases
(gen/once {:type :info :f :stop})
(gen/sleep 20)))
(gen/clients
(gen/once {:type :invoke :f :read})))}))
In this test, we isolate [n4], then [n5], then [n4], and then something interesting happens: both n5 and n4 consider themselves primaries, even after 120 seconds of n4 being isolated. On the next cycle round, n2 and n4 both consider themselves primaries!
In an attempt to speed up fault detection (this network is virtual with negligible latency and no contention except during simulated partitions), I've lowered the ping timeouts to the order of 2-3 seconds--the cluster seems to detect failures within ~10 seconds normally, but sometimes gets stuck!
discovery.zen.ping.timeout: 3s
# After joining, send a join request with this timeout? Docs are kinda unclear
# here.
discovery.zen.join_timeout: 5s
# Fault detection timeouts. Our network is super reliable and uncongested
# except when we explicitly simulate failure, so I'm gonna lower these to speed
# up cluster convergence.
discovery.zen.fd.ping_interval: 1s
discovery.zen.fd.ping_timeout: 1s
discovery.zen.fd.ping_retries: 2
# Publish cluster state updates more frequently
discovery.zen.publish_timeout: 5s
I've uploaded the Jepsen history, set analysis, and elasticsearch logs (timestamps are perfectly coordinated--they're LXC containers) to https://aphyr.com/media/es-create-loss.tar.gz, if you'd like to take a look. :)
Thx @aphyr
Elasticsearch can still lose inserted documents when a single primary node is isolated by a simple network partition
This might be #7572 , which is also documented here: http://www.elastic.co/guide/en/elasticsearch/resiliency/current/#_loss_of_documents_during_network_partition_status_ongoing
Other times, the cluster seems to get really confused
This one feels like something else, but it will require some research. Can you share the exact commit of Jepsen you used? That will help understand what exactly happened. I'm also curious to see what effect @dakrone 's suggestion to use scan scroll will have.
Hi @bleskes! This is from Jepsen 4fdf509d82620ac59b4daf4e6a21f9495c73e68a--just cd elasticsearch; lein test
. I've been breaking up the test suites in hopes of actually having Jepsen As A Library instead of a big monolithic hairball, haha. Slow going so far. ;-)
I've also switched to using scan scroll, but it doesn't seem to have any effect on observed data loss--I can reproduce it with both a size: n
query or a scrolling one.
Hi @aphyr , I've been looking through the https://aphyr.com/media/es-create-loss.tar.gz logs that you provided, however, it looks like the history.edn timestamps are nanoseconds relative to the time that the test was started.
Could you add or modify the timestamps to be absolute instead of relative so we can correlate the logs from the ES instances to the Jepsen history?
Sorry, I've got a huge backlog at the moment and have to get slides prepped, but you can just flip https://github.com/aphyr/jepsen/blob/master/jepsen/src/jepsen/core.clj#L114 to use (System/currentTimeMillis) if you want unix timestamps. Be advised that Jepsen has facilities for lying to the database about what time it is--those aren't currently in play on the Elasticsearch tests, but I wouldn't rely on the log timestamps going forward. ;-)
A quick update on this one - we have put some time into reproducing this and it looks more and more like #7572 . I will post an update once this becomes definite (or if anything else comes up).
Closed as duplicate of #7572
This is tough for me to reliably reproduce, but I've seen it a half-dozen times and think it needs an issue. Elasticsearch can still lose inserted documents when a single primary node is isolated by a simple network partition. Nothing fancy, no overlapping partitions required. Sometimes it's only a few documents, from immediately following the partition:
Other times, the cluster seems to get really confused, and loses over half of the acknowledged records, distributed throughout the history:
The logs aren't particularly informative here, but I have noticed some interesting messages which seem correlated with data loss: for instance:
and
which I think is a hardcoded message.
Again, these are tough to reproduce--only 1 in 10 runs or so--so I'll keep working on finding a failure schedule that reliably triggers the fault.