TritonDataCenter / rfd

Requests for Discussion
Mozilla Public License 2.0
261 stars 75 forks source link

RFD 103: Discussion #40

Open princesspretzel opened 7 years ago

princesspretzel commented 7 years ago

This issue represents an opportunity for discussion of RFD 103 Operationalize Resharding while it remains in a pre-published state.

chudley commented 7 years ago

Thanks for writing this up! I'm still working through the document to properly understand it, but I've got a few things I'd appreciate help understanding, as well as a few minor nits.

What decision making goes into the new vnode/pnode mapping? In the guide it looks like we pick the ones we do because they're the biggest. Is that representative of real usage?

Will there be tooling to make a new vnode/pnode mapping for us? I guess this is why the document has the section "Manual Resharding Step-By-Step", or will the automatic process require us to feed it with this new mapping?

And some nits:

It seems the use of the word "shard" is used when referring to a new database in a cluster. My understanding of this in Manta is that we're actually working with a new peer in a shard, where the shard here refers to (for example) 1.moray. I don't think your usage here is wrong as I guess it could technically be called a shard, but it might be easier for an operator of Manta to follow if we referred to them as peers. There might be some prior Manta documentation that refers to them as something else, however.

At a few points we refer back to "step 1", or where we deployed the new peer, but I think this doesn't happen until step 3.

In step 3, we document how to provision a new postgres instance by incrementing the number for the applicable shard. This document seems to be written for lab usage, or in a deployment where there is a single node. I think this is probably fine for now, but this exact procedure wouldn't be applicable to a production deployment as there would be only 1 postgres zone on each physical server, and incrementing this number would mean we're co-hosting postgres zones. In this case the operator would need to find a new home for the additional postgres zones, which today I think is done manually.

In step 7, I think there's a typo @ sapiadm update <zone_uuid> metadata.SHARD=4. Should this be 3 instead?

In step 7, there are some links to specific source code lines on GitHub, but the links are to master. Can we expand these to their canonical URLs for when this guide is used in the future? The easiest way I've found to do this is to hit "y" on the appropriate page, which is a GitHub keybinding to expand the URL.

I don't think I understand the diagrams in step 12. My understanding of the guide is that we're creating a new Manta shard called "3.moray", but "2.moray" is in the diagram. I also see "SHARD 4" as a header. I think I understand what we're trying to convey, which is that each peer has 1.moray data and (soon to be) 3.moray data that is replicated across all peers, but we haven't split the data yet and the diagram is walking us through that process. I can't say I have a better idea here, but perhaps what's confusing me is the labels.

KodyKantor commented 7 years ago

Thanks for writing this up! I have a few questions, a couple comments, and a few formatting things to point out.

A couple typo/formatting things:

Since documentation is a popular topic now, here are a couple notes/questions:

Thanks again! This is interesting. I'm excited to see this process in action!

davepacheco commented 7 years ago

Thanks for writing this up! This is really great progress.

Does "WLSLT" mean "will look something like this"? Can we expand that, even if only the first time?

In step 2, I think we should call this a "canary directory", since it's not a file (or even an object), as far as I can tell. Instead of creating a canary directory, can we provide steps for identifying an existing directory under "/poseidon/stor" that's assigned to one of the vnodes that's being moved? The approach that's here now seems to assume that the operator will be moving whichever vnode the newly-created canary directory winds up on, but I think in practice we'll want to select the vnodes to move based on other factors (e.g., distribution of data on them). In that case, we don't have a way to just create a canary directory on one of those vnodes. Sorry if I'm misunderstanding something about this.

I think we mentioned this in a Manta call, but I think we'll want to create two async peers in step 4 so that when we split the shard in step 7, we don't remain down for writes for the duration of an entire sync rebuild. Then in step 7, we'll want to move both new peers to the new shard.

There are a few items I think it would be call out explicitly in separate sections (i.e., outside the procedure itself):

I think Richard's point about provisioning the new async in step 3 is important. We likely won't be just bumping a counter of Postgres zones. But I think it's okay to leave some of these details out of this procedure, saying something like: "use a combination of manta-adm show -sj and manta-adm update to deploy new peers on the CNs where you want them to run. See the Manta Operator Guide for details about recommended placement of services."

At the end of step 3, we talk about comparing the SENT and REPLAY values from manatee-adm pg-status. Should we just have people use the LAG field instead?

There seems to be some complexity and impact in step 4 around the fact that we need to modify fash behind electric-moray's back, which means we have to turn off electric-moray. But won't this bring down reads and writes for all of Manta? How can we mitigate that? It seems like we could do this one e-moray zone at a time. Should we also consider making the remapping executed by a (new) electric moray API call? Also, presumably we need to do this for all e-moray instances in the zone, but the procedure doesn't mention that.

In step 5, "zookeeper will just check" should be "Manatee will just check".

In step 7: I don't understand how we ensure that we have the right Manatee cluster state in the new shard. I didn't quite follow the steps involving changing the ZK state and then using state-backfill.

Later in step 7, we use a query to find all names on the old vnode number. I think that's going to be untenable on our production databases, since I think that will return millions of rows and potentially take quite a long time. Is it missing part of the WHERE clause where we select the key for our canary directory?

Step 8 has a pretty big TODO.

In step 8, substep 5, it says we're downloading the new SAPI manifest into electric-moray, but then it reprovisions electric-moray onto the new image? I'm not sure what's supposed to be going on there. Is the database somehow built into the electric-moray image itself? I had thought it was either a delegated dataset or just a file that we download into the zone.

Similar to what was said earlier, in step 12, where it says "the primary in a new peer", I think that should be "the primary in a new cluster" or "the primary in a new shard".

In step 12, I think we're going to want to do the deletes in batches (e.g., with limits) to limit the length and size of transactions. I also wonder if we want to do these (and as many of the other queries as we can) through Moray.

It would be great to flesh out more how we're going to automate this. One of the major concerns I have is how this is going to work in a multi-DC environment. (I think this might greatly affect where the tools end up living and running.) Also: dealing with failure of the tools affects where they store state (e.g., in memory, on the filesystem, in a Moray bucket in some other shard, etc.).

princesspretzel commented 7 years ago

Thank you all for your thoughtful feedback! I have pushed an update, which hopefully addresses many of these points. There were a few concerns that I will have to think through and test further in order to say something definitive. @chudley, you are absolutely right about the diagram, it is much less clear than I had remembered it being, after some time away from it. I will rethink how I illustrate this process, but for now I have removed some of the labelling that might have been especially confusing. @KodyKantor I'm not sure if the updated documentation on node-fash addresses your concerns about documentation for LevelDB, but please let me know if not! @davepacheco, as we discussed, I need to reason about a multiple-electric-moray setup in greater detail, since I have been executing these steps on my lab environment. I will also update again with thoughts about your later points. Step 8's TODO is now a WIP, and every step after that is something I still have to get through once the last pieces of outstanding code (MANTA-3371, MANTA-3388) are merged. Thank you again for all your feedback, I will be back with more updates.