emqx / mria

Asynchronously replicated Mnesia-like database for Erlang/Elixir
https://www.emqx.com
Apache License 2.0
120 stars 23 forks source link

fix(bootstrap): wait core tables are ready before copying #183

Closed keynslug closed 18 hours ago

keynslug commented 3 days ago

In specific circumstances mria_mnesia:copy_table/2 may fail with {system_limit, '$mria_rlog_sync', {Node, none_active}} error, which crashes the node.

Consider the following scenario:

  1. Node N1 starts up and bootstraps Mria.
  2. Node N2 starts up and bootstraps Mria.
  3. Node N2 joins cluster consisting of node N1.
  4. Node N2 runs mria_mnesia:join_cluster/1 and starts Mria again.
  5. At the exact same time node N1 decides to restart for some reason.
  6. During bootstrap, node N2 tries to copy $mria_rlog_sync table.
  7. Mnesia sees there's nowhere to copy from and aborts the operation.
  8. Mria fails to start.

While unlikely, in practice this might be achieved when the operator performs unusual maintenance operations, e.g. simultaneously requests version upgrade and scales the cluster up.


Fixes EMQX-13309.