AntidoteDB / antidote

A planet scale, highly available, transactional database built on CRDT technology
https://www.antidotedb.eu
Apache License 2.0
834 stars 89 forks source link

Problems with inter_dc when the first address in the list leads to connection timeout. #470

Open define-null opened 2 years ago

define-null commented 2 years ago

So while experimenting with multi-dc setup on mac I discovered the following problems:

1) Antidote picks all inet interfaces on the machine (including address for special utun interfaces, which correspond to vpn virtual interfaces on mac) when trying to obtain the addresses via inter_dc_pub:getting_addresses/1. Which leads to zeromq connection timeout when trying to connect to port on such an address from inter_dc_sub:add_dc.

2) Second problem is that both timeouts for gen_server:call in inter_dc_sub:add_dc and zeromq connection timeouts are set to 5 seconds. So even though there is another address to try to connect to, the inter_dc_sub will just fail on the first faulty one. Retry logic in inter_dc_manager:connect_nodes on the other hand would not determine the reason for the failure and would just retry for the same node, with same address list, where first faulty address would be picked again, leading to the same failure.

First issue should be easily fixed by providing possibility to set the addresses antidote should listen on. The second though should result in more accurate decision how to handle reconnects.