Recovery from data centre failure in multi-dc not working

mikenorgate commented 4 years ago

I am doing some tests to see if antidote would be a suitable solution for us. I am running some tests with a setup based on this docker compose example https://github.com/AntidoteDB/docker-antidote/blob/master/compose-files/dc3n1/docker-compose.yml

I am testing the scenario where a single data centre becomes unavailable and I want to see what the recovery behaviour is. However I have found the upon restarting the failed DC it does not catch up with changes that it has missed.

Here is how I'm doing my test:

docker stop dc2n1
Increment a counter in dc1n1
docker start dc2n1

After restarting dc2n1 the following is output in the logs and the line Buffering txn in {{'antidote@dc1n1', is repeated forever

2020-04-24T13:05:00.728870+00:00 notice:
disk_log: repairing "/antidote-data/data/data_antidote/0--0.LOG" ...

2020-04-24T13:05:00.756855+00:00 notice:
disk_log: repairing "/antidote-data/data/data_antidote/730750818665451459101842416358141509827966271488--730750818665451459101842416358141509827966271488.LOG" ...

2020-04-24T13:05:01.790063+00:00 info:
Starting heartbeat sender timers
2020-04-24T13:05:01.791126+00:00 notice:
Forgetting DC {'antidote@dc3n1',
                  {1587,733399,89049}}
2020-04-24T13:05:01.791356+00:00 notice:
Forgetting DC {'antidote@dc1n1',
                  {1587,733359,875651}}
2020-04-24T13:05:01.791527+00:00 info:
Observing DC {'antidote@dc3n1',
                 {1587,733399,89049}}
2020-04-24T13:05:07.827192+00:00 info:
Observing DC {'antidote@dc1n1',
                 {1587,733359,875651}}
2020-04-24T13:05:10.434336+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434412+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434554+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434580+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434671+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434691+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:13.522205+00:00 info:
Connected to DC {'antidote@dc3n1',
                    {1587,733399,89049}}
2020-04-24T13:05:13.553140+00:00 info:
Waiting for DC {'antidote@dc1n1',
                   {1587,733359,875651}}
2020-04-24T13:05:14.473092+00:00 info:
Whoops, lost message. New is 42, last was 27. Asking the remote DC {{'antidote@dc1n1',
                                                                     {1587,
                                                                      733359,
                                                                      875651}},
                                                                    730750818665451459101842416358141509827966271488}
2020-04-24T13:05:14.474678+00:00 critical:
Received unexpected log_reader_resp messages in state normal Message []. State {inter_dc_sub_buf,
                                                                                normal,
                                                                                {{'antidote@dc2n1',
                                                                                  {1587,
                                                                                   733385,
                                                                                   215012}},
                                                                                 730750818665451459101842416358141509827966271488},
                                                                                init,
                                                                                {[],
                                                                                 []},
                                                                                true}
2020-04-24T13:05:14.553899+00:00 info:
Waiting for DC {'antidote@dc1n1',
                   {1587,733359,875651}}
2020-04-24T13:05:15.474089+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:15.554915+00:00 info:
Connected to DC {'antidote@dc1n1',
                    {1587,733359,875651}}
2020-04-24T13:05:15.615894+00:00 info:
Starting heartbeat sender timers
2020-04-24T13:05:15.616401+00:00 info:
    application: antidote
    started_at: 'antidote@dc2n1'
2020-04-24T13:05:15.686080+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686282+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686460+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686635+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686875+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686993+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:16.474943+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:17.476105+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:18.476976+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:19.477951+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:20.479036+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}

peterzeller commented 4 years ago

@dajenet This might be related to what you are working on. Can you check if you can reproduce this problem and whether it is fixed on your branch?

dajenet commented 4 years ago

Looks like the problem I encountered. I will take a look at it.

AntidoteDB / antidote

Recovery from data centre failure in multi-dc not working #420