RFD 103: Discussion - Githubissues

This issue represents an opportunity for discussion of RFD 103 Operationalize Resharding while it remains in a pre-published state.

Thanks for writing this up! I'm still working through the document to properly understand it, but I've got a few things I'd appreciate help understanding, as well as a few minor nits.

What decision making goes into the new vnode/pnode mapping? In the guide it looks like we pick the ones we do because they're the biggest. Is that representative of real usage?

Will there be tooling to make a new vnode/pnode mapping for us? I guess this is why the document has the section "Manual Resharding Step-By-Step", or will the automatic process require us to feed it with this new mapping?

And some nits:

It seems the use of the word "shard" is used when referring to a new database in a cluster. My understanding of this in Manta is that we're actually working with a new peer in a shard, where the shard here refers to (for example) 1.moray. I don't think your usage here is wrong as I guess it could technically be called a shard, but it might be easier for an operator of Manta to follow if we referred to them as peers. There might be some prior Manta documentation that refers to them as something else, however.

At a few points we refer back to "step 1", or where we deployed the new peer, but I think this doesn't happen until step 3.

In step 3, we document how to provision a new postgres instance by incrementing the number for the applicable shard. This document seems to be written for lab usage, or in a deployment where there is a single node. I think this is probably fine for now, but this exact procedure wouldn't be applicable to a production deployment as there would be only 1 postgres zone on each physical server, and incrementing this number would mean we're co-hosting postgres zones. In this case the operator would need to find a new home for the additional postgres zones, which today I think is done manually.

In step 7, I think there's a typo @ sapiadm update <zone_uuid> metadata.SHARD=4. Should this be 3 instead?

In step 7, there are some links to specific source code lines on GitHub, but the links are to master. Can we expand these to their canonical URLs for when this guide is used in the future? The easiest way I've found to do this is to hit "y" on the appropriate page, which is a GitHub keybinding to expand the URL.

I don't think I understand the diagrams in step 12. My understanding of the guide is that we're creating a new Manta shard called "3.moray", but "2.moray" is in the diagram. I also see "SHARD 4" as a header. I think I understand what we're trying to convey, which is that each peer has 1.moray data and (soon to be) 3.moray data that is replicated across all peers, but we haven't split the data yet and the diagram is walking us through that process. I can't say I have a better idea here, but perhaps what's confusing me is the labels.

Thanks for writing this up! I have a few questions, a couple comments, and a few formatting things to point out.

In the below quoted sentence, is the 'manta zone' the zone that's called 'manta0'? I've seen other docs call that the 'manta deployment zone' if that's the case.

We can find out which ones these are by running manta-shardadm list in our manta zone and noting[...]
In the 'Create a canary file' section, 'coal.joyent.us' snuck in there, and the rest of the document is referring to an emy machine.
Maybe instead of suggesting 'manta-adm show -sj, and copy that to a new file', we can suggest to do something like manta-adm show -sj > my_config.json. That way we can refer to my_config.json. It might help those sections require a little less thought on the part of the reader :).
The document mentions that ongoing PG dumps may interfere with resharding. I'm curious to know what might happen if a PG dump is taking place while this is being done. I thought a PG dump was mostly a read-only operation.
In the section 7 nameservice code blocks it's a little unclear which commands are being run inside the zkCli.sh shell and which are just normal bash commands.
When we use node-fash to remap vnodes, is 'new.pnode' actually the name of the key, or are we supposed to replace that with something?
In section 12 rather than immediately removing the old vnode mappings from the previous pnode that held them, should we advise doing something to temporarily 'hide' the old vnode mappings? That step seems very frightening, so it would be nice if we could provide a backout plan if something goes wrong here. I'm not sure how this would be possible, but it would be analogous to doing a mv <config.json> <config.json.bak> before overwriting a config.json file.

A couple typo/formatting things:

I think the rendering of this paragraph may be missing a space:

NOTE: I'm not so sure about this as a test with the current topology. Check that the _id field increments on the new shard rather than the old one. You can see this if you run this query on the primaries of the old and the new shard and make sure that the _id field is larger on the new shard:
Possible typo in the section 7 code block:

sapiadm update } [...]

Since documentation is a popular topic now, here are a couple notes/questions:

I'm curious how we decide how many vnodes to use. I'm guessing that a larger number means more even distribution of objects over pnodes, but I don't really understand where the 'sweet spot' is.
It may also be helpful to write up some documentation somewhere for the leveldb file. I really have no idea what is in those files. I just know it as a magical thing used to do consistent hashing. Maybe there is already documentation somewhere that we could point consumers to, but I couldn't find anything after doing a quick search in the electric-moray, node-fash repos, or operator guide.
If the manatee troubleshooting guide is out of date, we can create a ticket to update this. After a quick skim, it looks like a couple commands have had their output/arguments changed.

Thanks again! This is interesting. I'm excited to see this process in action!

Thanks for writing this up! This is really great progress.

Does "WLSLT" mean "will look something like this"? Can we expand that, even if only the first time?

In step 2, I think we should call this a "canary directory", since it's not a file (or even an object), as far as I can tell. Instead of creating a canary directory, can we provide steps for identifying an existing directory under "/poseidon/stor" that's assigned to one of the vnodes that's being moved? The approach that's here now seems to assume that the operator will be moving whichever vnode the newly-created canary directory winds up on, but I think in practice we'll want to select the vnodes to move based on other factors (e.g., distribution of data on them). In that case, we don't have a way to just create a canary directory on one of those vnodes. Sorry if I'm misunderstanding something about this.

I think we mentioned this in a Manta call, but I think we'll want to create two async peers in step 4 so that when we split the shard in step 7, we don't remain down for writes for the duration of an entire sync rebuild. Then in step 7, we'll want to move both new peers to the new shard.

There are a few items I think it would be call out explicitly in separate sections (i.e., outside the procedure itself):

details about the impact to service, including what's affected (presumably mostly writes, but also a few small read windows), for how long (i.e., during which steps), and how they will fail (e.g., time out? explicit error?). I think this was already on your mind because you mention it in the Automation section.
information about rollback options. There's obviously going to be a point of no return, but it would be good to make that clear and explain how to rollback this whole process if we choose to do that before that point.

I think Richard's point about provisioning the new async in step 3 is important. We likely won't be just bumping a counter of Postgres zones. But I think it's okay to leave some of these details out of this procedure, saying something like: "use a combination of manta-adm show -sj and manta-adm update to deploy new peers on the CNs where you want them to run. See the Manta Operator Guide for details about recommended placement of services."

At the end of step 3, we talk about comparing the SENT and REPLAY values from manatee-adm pg-status. Should we just have people use the LAG field instead?

There seems to be some complexity and impact in step 4 around the fact that we need to modify fash behind electric-moray's back, which means we have to turn off electric-moray. But won't this bring down reads and writes for all of Manta? How can we mitigate that? It seems like we could do this one e-moray zone at a time. Should we also consider making the remapping executed by a (new) electric moray API call? Also, presumably we need to do this for all e-moray instances in the zone, but the procedure doesn't mention that.

In step 5, "zookeeper will just check" should be "Manatee will just check".

In step 7: I don't understand how we ensure that we have the right Manatee cluster state in the new shard. I didn't quite follow the steps involving changing the ZK state and then using state-backfill.

Later in step 7, we use a query to find all names on the old vnode number. I think that's going to be untenable on our production databases, since I think that will return millions of rows and potentially take quite a long time. Is it missing part of the WHERE clause where we select the key for our canary directory?

Step 8 has a pretty big TODO.

In step 8, substep 5, it says we're downloading the new SAPI manifest into electric-moray, but then it reprovisions electric-moray onto the new image? I'm not sure what's supposed to be going on there. Is the database somehow built into the electric-moray image itself? I had thought it was either a delegated dataset or just a file that we download into the zone.

Similar to what was said earlier, in step 12, where it says "the primary in a new peer", I think that should be "the primary in a new cluster" or "the primary in a new shard".

In step 12, I think we're going to want to do the deletes in batches (e.g., with limits) to limit the length and size of transactions. I also wonder if we want to do these (and as many of the other queries as we can) through Moray.

It would be great to flesh out more how we're going to automate this. One of the major concerns I have is how this is going to work in a multi-DC environment. (I think this might greatly affect where the tools end up living and running.) Also: dealing with failure of the tools affects where they store state (e.g., in memory, on the filesystem, in a Moray bucket in some other shard, etc.).

Thank you all for your thoughtful feedback! I have pushed an update, which hopefully addresses many of these points. There were a few concerns that I will have to think through and test further in order to say something definitive. @chudley, you are absolutely right about the diagram, it is much less clear than I had remembered it being, after some time away from it. I will rethink how I illustrate this process, but for now I have removed some of the labelling that might have been especially confusing. @KodyKantor I'm not sure if the updated documentation on node-fash addresses your concerns about documentation for LevelDB, but please let me know if not! @davepacheco, as we discussed, I need to reason about a multiple-electric-moray setup in greater detail, since I have been executing these steps on my lab environment. I will also update again with thoughts about your later points. Step 8's TODO is now a WIP, and every step after that is something I still have to get through once the last pieces of outstanding code (MANTA-3371, MANTA-3388) are merged. Thank you again for all your feedback, I will be back with more updates.

TritonDataCenter / rfd

RFD 103: Discussion #40