basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.93k stars 534 forks source link

Exchanges continually report as not_responsible #1011

Open martincox opened 4 years ago

martincox commented 4 years ago

2020-03-20 10:08:14.611 [info] <0.1670.0>@riak_kv_entropy_manager:maybe_exchange:986 Attempt to start exchange between {1096126227998177188652763624537212264741949407232,3} and 1141798154164767904846628775559596109106197299200 resulted in not_responsible

Must be the case that the exchange_queue in entropy_manager is being populated errorneously. I'll dig into this a bit more, doesn't seem like expected behaviour, and I've not noticed it in previous builds. Have you seen this befgore @martinsumner?

martinsumner commented 4 years ago

Is this during a cluster change operation, or ongoing in steady-state running?

martincox commented 4 years ago

Ongoing, after the cluster has been initialised and transfers have completed etc.

martinsumner commented 4 years ago

Hmm, looked back at the last time I tested with this form of AAE enabled - and I got this constantly in the background as well.

I've touched the code here, so let me have a look - this isn't right.

I have it in Riak 2.9.1, which version have you found it in?

martincox commented 4 years ago

Merged bet branches into a 3.1 and saw it there, so assumed it was something from my merge. But looked on 3.0 and it's there also.

martinsumner commented 4 years ago

Ah - my change was to add the log!

martincox commented 4 years ago

Hmmm, so it's possibly something that's been there for a while, but silently?

martinsumner commented 4 years ago

Perhaps. I may go back to when the log was added to make sure it occurs in that version. The log got introduced in 2.9.0

martinsumner commented 4 years ago

I have a working theory, that this is just wrong:

https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_entropy_manager.erl#L958-L964

an Exchange is {LocalIdx, RemoteIdx, IndexN} - and prune_exchanges/1 in some cases switches the local and remote index!!?? Why?

The RemoteIdx must be on another node (if target_n__val has been met by the cluster logic). So when this happens, and the RemoteIdx is used as a LocalIdx, that partition cannot be a primary when he hit this point - https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_entropy_manager.erl#L895.

I just don't understand what prune_exchanges/1 is trying to achieve. It has been there since the introduction of AAE.

martinsumner commented 4 years ago

The only theory is that prune_exchanges/1 is to make sure that each exchange happens only in 1 direction. So by screwing it by switching the indexes in the other direction all exchanges occur, but once not twice?

martinsumner commented 4 years ago

So I think that does make sense. With prune_exchanges/1 doing the index switcheroo whenever LocalIdx > RemoteIdx, then 50% of the exchanges result in not_responsible.

So Two nodes will generate the exchange between A and B for IndexN: one as {A, B, IndexN}, the other as {B, A, IndexN}. The first node will run the exchange for that combination, the second will not. So the exchange will happen, but only once per cycle.

If I remove prune_exchanges/1 (just make it lists:usort(Exchanges)) - then all exchanges result in ok - but for each cycle every exchange happens twice (once from each node involved in the exchange). So prune_exchanges/1 is deliberately hitting an unexpected response, to behave in an expected way.

So I think the not_responsible is safe to ignore for now. Whether the code should do this or not ... I'm not sure.

martincox commented 4 years ago

Yep, you're right there I think. It seems like it is deliberately switched to prevent us from uneccessarily doing the same exchange twice, instigated from both the owner and the fallback, which makes sense. Just seems a crappy way of going about it - unless there's more to it that I'm not seeing.

Changing the list comp in prune_exchanges produces the same thing but without the incorrect exchanges. Works as expected, with no not_responsible exchanges being handled.

L = [Exchange || {A, B, _IndexN} = Exchange <- Exchanges, A < B]

I'll make this change PR anyway and let it sit for review for a bit / mull it over.