erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.18k stars 2.92k forks source link

Mnesia is unable to merge the schema for tables using external storage backends #7423

Open ieQu1 opened 1 year ago

ieQu1 commented 1 year ago

Describe the bug mnesia_schema.erl contains the following code: https://github.com/erlang/otp/blob/f820bad7bed34cc4365bb9cac56eaa84c9a5bddc/lib/mnesia/src/mnesia_schema.erl#L3667

This function is called during schema merging. However, it doesn't handle external storage backends created via mnesia:add_backend_type (e.g. https://github.com/aeternity/mnesia_rocksdb), causing the following error:

(<0.2909.0>) call mnesia_schema:change_storage_type('emqx2@127.0.0.1',{ext,rocksdb_copies,mnesia_rocksdb},{cstruct,emqx_ee_schema_registry_protobuf_cache_tab,set,[],[],[],
         [{{rocksdb_copies,mnesia_rocksdb},['emqx1@127.0.0.1']}],
         0,read_write,false,[],[],false,protobuf_cache,
         [fingerprint,module,module_binary],
         [],[],[],
         {{1686330881043734966,-576460752303423474,1},'emqx1@127.0.0.1'},
         {{4,1},{'emqx1@127.0.0.1',{1687,264314,407434}}}}) ({mnesia_schema,
                                                              merge_storage_type,
                                                              5})

To Reproduce

  1. Add an external storage backend (e.g. mnesia_rocksdb:register) in a cluster of two nodes (A and B)
  2. Create a table with this backend and ensure it has copies on both nodes.
  3. Trigger schema merge. We did it by shutting down B, removing a remote table copy on the surviving node A and restarting B, but there could be an easier way.
  4. Mnesia on B fails to start with this error.

Expected behavior Schema is merged.

Affected versions Probably all OTP versions that support 3rd party backends.

Additional context

dgud commented 10 months ago

I have problems reproducing this with the instructions you gave, can you write a testcase in mnesia. There is a table type ext_ets and ext_dets default configured that you can use to remove the mnesia_rocksdb dependency.

ieQu1 commented 10 months ago

Hello,

I think we found a reliable way to reproduce it in our own test suite. Porting to the OTP test suite may take time, since I am not familiar with it, but I might attempt it.

The steps are:

dgud commented 8 months ago

Would still like to have a testcase or some code that I can run which reproduces this.

IngelaAndin commented 7 months ago

ping @ieQu1

axpxp commented 1 month ago

@dgud I have a testcase, how to send it to you?

dgud commented 1 month ago

Post it here, or add a gist. The testcase should be without mnesia_rocksdb I don't want to debug that.

axpxp commented 1 month ago

@dgud I have re-uploaded erlang27 version, please check, thank you

8045

Mikaka27 commented 1 month ago

I don't think I'm seeing the correct problem when running this:

mnesia_bug_stacktrace.txt

Please verify, not familiar with rocksdb at all. Would it be possible to have a reproduction without mnesia_rocksdb?

ieQu1 commented 1 month ago

Hello,

Sorry for no answer, I was head deep in other stuff. This problem is pretty rare (thankfully), so I don't know the precise conditions to trigger it. I pinpointed the function with a missing clause from the stacktrace, and have a preliminary fix, but no reliable way to test it.

Maybe OTP experts can suggest what scenarios can trigger various types of schema merge. Edit: apparently it's right there https://github.com/erlang/otp/issues/7423#issuecomment-1717553324

Mikaka27 commented 1 month ago

Hello,

Sorry for no answer, I was head deep in other stuff. This problem is pretty rare (thankfully), so I don't know the precise conditions to trigger it. I pinpointed the function with a missing clause from the stacktrace, and have a preliminary fix, but no reliable way to test it.

Maybe OTP experts can suggest what scenarios can trigger various types of schema merge.

But you mean that there is an error in your reproduction? And that's why I'm seeing this wrong stacktrace?

But otherwise this reproduction should trigger the error? After running a few times perhaps?

ieQu1 commented 1 month ago

Your stacktrace looks different. You likely have stumbled on a different issue that looks specific to rocksdb:

{noproc,
                                       {gen_server,call,
                                        [mnesia_rocksdb_admin,
                                         {rdb,{get_ref,t}},
                                         infinity]}}

This doesn't look like a Mnesia process.

In our case, BUP was not involved. It happened after a regular node restart.

axpxp commented 1 month ago

I don't think I'm seeing the correct problem when running this:

mnesia_bug_stacktrace.txt

Please verify, not familiar with rocksdb at all. Would it be possible to have a reproduction without mnesia_rocksdb?

mnesia:add_backend_type(Alias, Module) after,If the table is empty,Everything is fine,because Module:init_backend first call. mnesia:add_backend_type(Alias, Module) after,If there's data in the table,mnesia:backup("bk.BUP") after mnesia:install_fallback("bk.BUP") after mnesia:start(),There will be bugs,because Module:init_backend not call.