Seagate / halon

High availability solution
Apache License 2.0
1 stars 0 forks source link

RFC for replicated log improvements #100

Closed 1468ca0b-2a64-4fb4-8e52-ea5806644b4c closed 5 years ago

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 10 years ago

Created by: klao-tweag

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 10 years ago

Created by: mboes

Close?

Random back-off: this is already implemented in consensus-paxos. Create ticket if not suitable somehow.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 10 years ago

Created by: klao-tweag

from the list so far, fixing the queuing situation with channels sounds like a good one to start with. What do you think?

We'd need to add channel support to the scheduler first. (Which is a simple and self-contained task I wouldn't mind working on.)

Otherwise, I don't think it's the most pressing issue. Especially with batching in place.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 10 years ago

Created by: klao-tweag

All of the content so far is highly appropriate for Asana, there is no content yet that is appropriate for an RFC, so let's do that and not merge this. 

Sure. You suggested compiling a list, and I agree that it's a good idea to have a complete overview of things that need to be done. But yeah, a different format, like asana tag, is probably much more appropriate for this. I'll take a look at how usable it is.

Many of the issues you bring up are complexities that were introduced with leader leases, which you advocated we merge as-is. Reading your analyses, I feel we merged that too hastily. Especially given that batching was orthogonal (not completely, but as I said in that PR, it should be given the design you chose).

Two of the issues are specifically about leader leases. But both of them are simple and directly addressable. Other than that, it made the "unclean state" issue slightly worse on one hand, but on the other hand it also improved it (solved other issues). Everything else is orthogonal to it.

I suggested merging and improving on it, and I still stand by this decision. I think it's a better model than having protracted code reviews, which makes working on other improvements much harder.

About using the mailbox as a queue. I disagree with your analysis. Mailboxes as queues is precisely what mailboxes are for. The real issue is that it's not good for performance to use a single queue. That's where channels come in: they allow a process to have multiple mailboxes, i.e. multiple incoming queues. Channels are probably a good idea anyways: they provide stronger static guarantees.

I was commenting on the current use of process mailboxes, in their specific cloud haskell meaning. I agree that channels might be a good solution for this type of problems.

About teleportation. I thought you agreed that it was needed for pipelining. I don't see a point in removing it only to add it back pretty much as-is once we actually do pipelining.

It was about legislatures. Without pipelining even legislatures are not needed, but we will need them if we want pipelining. Teleportation is not needed for pipelining.

About random back-off, we did have that at some point. Has it been removed?

Sorry, I don't know. :) I looked at the history, and there was no chage like that (no significant change at all) after the initial import. But before that...

There are more issues than these in replicated-log and consensus-paxos. We should use Asana tags to easily find and group them with the ones you list.

Right, yes. Facundo pointed out a few.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 10 years ago

Created by: mboes

@klao-tweag from the list so far, fixing the queuing situation with channels sounds like a good one to start with. What do you think?

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 10 years ago

Created by: mboes

Thank you for compiling this list.

All of the issues listed here are already known, save perhaps for the bug that you mention at the beginning (but it rings a bell, was probably part of the ParSci backlog but all those tickets were lost). More crucially, RFC's are for setting forth and motivating a plan, which I don't see here. What I do see is good descriptions and analyses for most of these known issues. You should therefore make sure that there is an Asana ticket for each point that you make (as I said, many were lost), and copy/paste the descriptions for each. They will invariably be more detailed and carefully worded than whatever is in Asana currently.

All of the content so far is highly appropriate for Asana, there is no content yet that is appropriate for an RFC, so let's do that and not merge this. :-1:

Many of the issues you bring up are complexities that were introduced with leader leases, which you advocated we merge as-is. Reading your analyses, I feel we merged that too hastily. Especially given that batching was orthogonal (not completely, but as I said in that PR, it should be given the design you chose).

About using the mailbox as a queue. I disagree with your analysis. Mailboxes as queues is precisely what mailboxes are for. The real issue is that it's not good for performance to use a single queue. That's where channels come in: they allow a process to have multiple mailboxes, i.e. multiple incoming queues. Channels are probably a good idea anyways: they provide stronger static guarantees.

About teleportation. I thought you agreed that it was needed for pipelining. I don't see a point in removing it only to add it back pretty much as-is once we actually do pipelining.

About random back-off, we did have that at some point. Has it been removed?

There are more issues than these in replicated-log and consensus-paxos. We should use Asana tags to easily find and group them with the ones you list.

The good news is, that all in all, I see nothing here so far that warrants a rewrite. I'm sure there is some amount of restructuring that ought to be made, even at constant feature set, but none of the listed points motivate that, except perhaps the "underspecified and unhelpful state".

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 10 years ago

Created by: facundominguez

There is this bug that apparently is already addressed but has no tests: https://app.asana.com/0/12314345447678/10803042979172

This bug still exists and it is not listed I think: https://app.asana.com/0/12314345447678/10803042979170

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 10 years ago

Created by: klao-tweag

Idempotent operations would also suffer because the operations might not happen one immediately after the other, but they may be interspersed with other operations.

Yes, correct, thanks for clarifying this! We need a precise way to talk about operations that must not be applied willy-nilly multiple times.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 10 years ago

Created by: facundominguez

And we would be applying a (non-idempotent) operation twice when we

Idempotent operations would also suffer because the operations might not happen one immediately after the other, but they may be interspersed with other operations.