TritonDataCenter / rfd

Requests for Discussion
Mozilla Public License 2.0
261 stars 75 forks source link

RFD 174 Manta storage efficiency discussion #142

Open jjelinek opened 5 years ago

jjelinek commented 5 years ago

https://github.com/joyent/rfd/tree/master/rfd/0174

askfongjojo commented 5 years ago

Thank you for writing up this RFD. I just want to share what I know about some of the open questions about mako service discovery and TTL:

Is having the same IP all that is necessary?

Probably not. The storage endpoint resolution uses nameservice. This is how a storage record looks like in zk within the nameservice zone:

[zk: localhost:2181(CONNECTED) 5] get /us/joyent/staging/stor/3
{"type":"host","address":"172.27.5.12","ttl":7200,"host":{"address":"172.27.5.12"}}
...

In a failover situation, we can update the record to point to the new IP address of the passive server.

Is there some way to configure DNS so that records have a short TTL and we quickly flip?

The ttl is defaulted to 1800 seconds and written to the zk record as shown above. I'd defer to others who understand this area better to advise how to force muskie to expire the record and do a lookup again.

Is there some other approach, perhaps using cueball or other upstack services which we can use to quickly flip the mako over when the passive machine becomes the active machine?

We can probably model after metadata failover mechanism. It also involves updating the corresponding zk record in nameservice.

mgerdts commented 5 years ago

Because all of these machines must have some basic local storage and availability, we could run them off of mirrored disks for the local "zones" zpool.

This adds 2 disks of overhead on each machine. Because the iscsi target machines are able to serve multiple upstack active/passive storage servers, we can amortize the 2 disk overhead on each target and only count these 2 disks once.

This 5.5% overhead (with 36 disks) or worse for systems with fewer disks. Perhaps it would be better to partition the disks and use a small amount (less than 5 GiB) from each and construct a pool across them. I don't understand the justification for setting WCE=disabled when not in whole disk mode, so that would need to be explored. It seems as though if the consumers of each partition were issuing flushes at the time critical to each, WCE=enabled is safe.

If it weren't for needing a place for dump files, I'd be inclined to persist networking and iscsi configuration (both of which should be quite static) as boot modules and give all local storage to iscsi LUNs.

mgerdts commented 5 years ago

Raw disks

If we want each iscsi target disk to map directly to a physical disk, then our current shrimps (repurposed as iscsi target boxes) have 36 disks, 34 of which could be used for iscsi target storage, leaving 2 which would be mirrored system disks (the 'zones' zpool). Having 34 disks is a bad number in a 20-wide configuration, so one option is to remove 14 disks and use them elsewhere.

Can you clarify what layout is problematic? The diagram makes it look like there will be 20 shrimp, each one mapping each disk to one stripe of a 17+3 raidz3 pool. If the shrimp have N data disks, I'd expect that there are N pools in a matching number of active servers.

mgerdts commented 5 years ago

ZVOLS

There are a number of things to consider with this approach, chief among them are ZFS on ZFS issues. Previous exploration of an architecture that would layer ZFS on ZFS with iscsi in the mix turned up a number of issues.

mgerdts commented 5 years ago

It is important to note that the active/passive storage zpool cannot be configured with a slog, since there is no good solution for preventing data loss from the slog in the event of a failover.

Could you not slice up SSDs in a couple boxes and mirror slogs across them?

mgerdts commented 5 years ago

We will have one machine which assumes the active role (it has the manta data zpool imported and is providing the mako storage zone service), and one machine in the passive role (it has iscsi configured to see all 20 disks, but does not have the zpool imported and the mako storage zone is not booted until the active machine fails).

It seems rather expensive to have 2N servers, with half of them as standbys. It would seem better to have a fewer number of standby servers with the ability to float iscsi initiators around the pool of servers. Such an architecture would probably need to be able to cope with more failures than expected by doubling up pools on servers, which has implications that may or may not be hard to deal with.

mgerdts commented 5 years ago

Mako storage Zone Manta Visibility

If the active/passive mako zones share the same IP, when we have a flip, how quickly will this propagate through the routing tables? Is there anything we could actively do to make it faster? Is having the same IP all that is necessary?

Routing tables shouldn't need to be updated, just ARP caches. I think that when an address is plumbed that a gratuitous ARP is broadcast. Presuming the systems that need to know about the change accept gratuitous ARPs and are not so busy that it gets dropped, the delay should be quite small.

By the time that the failover happens, TCP sessions may have backed off quite a bit. This could cause an extended delay (10s of seconds or worse?) as the healthy end of the connection will not know that it needs to reestablish the connection until it sends a packet that is received by the standby IP stack and that stack sends a RST because of the unknown connection. This problem exists in pretty much any failover scenario, regardless of IP failover strategy.

mgerdts commented 5 years ago

Active/Passive Failover

If the active server loses its network connection, it will no longer be able to ack heartbeats, but it will also lose access to the iscsi target disks, so it should be possible for the passive machine to import the zpool.

Earlier you talked about unique networks (and maybe switches) for iscsi traffic. Failure of a heartbeat network does not necessarily imply iscsi is dead.

Failover is much easier when failure of one or more nodes leaves a machine capable of taking over that is part of a majority (quorum). Pairs of servers, each failing over to the other in the pair may be harder to get right than having all servers trying to form a quorum or having an odd number of management nodes that are responsible for maintaining a quorum and orchestrating management operations from a member of that quorum.

My gut says we should not be writing clustering software from scratch.

KodyKantor commented 5 years ago

@askfongjojo

Is having the same IP all that is necessary?

Probably not. The storage endpoint resolution uses nameservice. This is how a storage record looks like in zk within the nameservice zone:

[zk: localhost:2181(CONNECTED) 5] get /us/joyent/staging/stor/3
{"type":"host","address":"172.27.5.12","ttl":7200,"host":{"address":"172.27.5.12"}}
...

In a failover situation, we can update the record to point to the new IP address of the passive server.

Is there some way to configure DNS so that records have a short TTL and we quickly flip?

The ttl is defaulted to 1800 seconds and written to the zk record as shown above. I'd defer to others who understand this area better to advise how to force muskie to expire the record and do a lookup again.

This is an open question. I think @bahamat mentioned that we have had experiences where small TTLs for storage instances caused pain in zookeeper. I would prefer this approach over updating ARP caches on switches due to the mysterious stale ARP cache bugs we've seen in production in the last few years.

As a note to my future self, the registrar README has a section describing how TTLs behave. Relying on TTLs in the way described in this RFD goes against one of the assumptions outlined in the README ('However, the TTLs on the "A" resolutions can be much longer, because it's almost unheard of for the IP address to change for a specific Triton or Manta zone').


From the RFD in regards to dense shrimp (4u60+ boxes vs the 4u36 we use today):

No matter what, it seems clear that the "dense shrimp" model is not a good fit for an iscsi target in this proposed distributed storage world. The dense shrimp has so much storage that the 20 zvols would be very large and incur very long resilver times after a failure.

Could we allow one shrimp be a target for multiple storage groups? For example, instead of creating 20 volumes (whether these are zvols or bare disks over iSCSI), create 60 volumes with 20 volumes each consumed by one storage group. This would mean a shrimp with 60 or more disks is a member of three storage groups, possibly with three different local raidz1 pools and 20 zvols on each pool.

Maybe this is a 'walk before you run' problem that needs to be solved though.


A relatively small open question is how we handle configuration for the mako service. I think we would need to duplicate the SAPI instance metadata for each of the active/passive pairs if we intend for the mako zones to look the same to the outside world. Either that or we would need to assign the same SAPI instance uuid to both of the active/passive instances so they look up the same metadata in SAPI.


ZVOLS

Do we need to worry about putting a pool on top of zvols when we think about object deletion activity?

For example, say we write 100G of data to the upper raidz3 pool. Now we delete 50G of data from the upper raidz3 pool. I imagine the capacity used by the sum of the raidz1 pools remains at 100G (5G each for the 20 raidz1 pools) because delete operations aren't immediately sent down to the underlying zvols, but the capacity used by the raidz3 pool has become 50G. If we keep writing and deleting data would we run into a situation where we the raidz1 pools fill up before the upper raidz3 pool has reached capacity?

TRIM can prevent us from hitting this problem, right? I expect we might see a similar problem today with KVM/bhyve machines running on top of the zvols we provide as data disks.


@mgerdts

We will have one machine which assumes the active role (it has the manta data zpool imported and is providing the mako storage zone service), and one machine in the passive role (it has iscsi configured to see all 20 disks, but does not have the zpool imported and the mako storage zone is not booted until the active machine fails).

It seems rather expensive to have 2N servers, with half of them as standbys. It would seem better to have a fewer number of standby servers with the ability to float iscsi initiators around the pool of servers. Such an architecture would probably need to be able to cope with more failures than expected by doubling up pools on servers, which has implications that may or may not be hard to deal with.

Yeah, this does seem expensive. Assuming that each storage group has an active/passive pair then we need (num_storage_servers / 20 * 2) machines for active/passive pairs. That seems like a lot, but perhaps we can colocate these active/passive instances with the existing Manta metadata tier services.


An additional point of comparison with our current manta storage tier is that the basic building block is 2 shrimps in 2 different AZs (because we store 2 copies of an object). In this new approach, the basic building block is the storage group with 20 iscsi targets (or whatever raidz3 width is chosen instead), 20 active and 20 passive servers. All 60 of these machines should be close together on the same network.

One of the tradeoffs that you outline here is that we move from a very flexible design where storage nodes are independent of one another in almost every way to a design where storage nodes are very interconnected and may have specific network requirements. It would be good to hear from the operations team(s) to know if this would make pre-flight checks, DC expansions, etc. prohibitively difficult in the long term. If there are some anticipated problems maybe we can work out a solution or design change before we face the problem in reality.

jjelinek commented 5 years ago

Raw disks

If we want each iscsi target disk to map directly to a physical disk, then our current shrimps (repurposed as iscsi target boxes) have 36 disks, 34 of which could be used for iscsi target storage, leaving 2 which would be mirrored system disks (the 'zones' zpool). Having 34 disks is a bad number in a 20-wide configuration, so one option is to remove 14 disks and use them elsewhere.

Can you clarify what layout is problematic? The diagram makes it look like there will be 20 shrimp, each one mapping each disk to one stripe of a 17+3 raidz3 pool. If the shrimp have N data disks, I'd expect that there are N pools in a matching number of active servers.

Yes, I think that is correct. The thing that is problematic is having an iscsi target with 36 disks when we can only make use of 20 data disks and the 2 system disks since there are 14 disks we cannot use.

jjelinek commented 5 years ago

It is important to note that the active/passive storage zpool cannot be configured with a slog, since there is no good solution for preventing data loss from the slog in the event of a failover.

Could you not slice up SSDs in a couple boxes and mirror slogs across them?

Yes, that might be an option if the slog over iscsi performance is acceptable, along with the added cost. I'll make a note of this as an alternative to investigate.

jjelinek commented 5 years ago

We will have one machine which assumes the active role (it has the manta data zpool imported and is providing the mako storage zone service), and one machine in the passive role (it has iscsi configured to see all 20 disks, but does not have the zpool imported and the mako storage zone is not booted until the active machine fails).

It seems rather expensive to have 2N servers, with half of them as standbys. It would seem better to have a fewer number of standby servers with the ability to float iscsi initiators around the pool of servers. Such an architecture would probably need to be able to cope with more failures than expected by doubling up pools on servers, which has implications that may or may not be hard to deal with.

Having distinct pairs of active/passive machines is much easier to install, manage and reason about when it comes to failures. It is simple and clear to understand what happens when an active machine dies. Once we lose that pairing, there has to be some other HA thing that can manage the failover, configure the passive machine appropriately, etc. I think this should be considered under a complete storage unit cost analysis, although the solution you're proposing will add new failure modes and cost in other ways, along with additional time before the solution could be ready.

jjelinek commented 5 years ago

Active/Passive Failover

If the active server loses its network connection, it will no longer be able to ack heartbeats, but it will also lose access to the iscsi target disks, so it should be possible for the passive machine to import the zpool.

Earlier you talked about unique networks (and maybe switches) for iscsi traffic. Failure of a heartbeat network does not necessarily imply iscsi is dead.

Failover is much easier when failure of one or more nodes leaves a machine capable of taking over that is part of a majority (quorum). Pairs of servers, each failing over to the other in the pair may be harder to get right than having all servers trying to form a quorum or having an odd number of management nodes that are responsible for maintaining a quorum and orchestrating management operations from a member of that quorum.

My gut says we should not be writing clustering software from scratch.

Sorry I wasn't clear. There is no heartbeat network. My assumption is that the network communication within the storage group is all on the same vlan or switch which is isolated and separate from the network connections to the rest of Manta. Thus, when a primary cannot ack heartbeats because its network is dead, it also cannot talk to its iscsi targets. Likewise, when a passive machine cannot see heartbeat acks because its network is dead, it also cannot successfully import the zpool. I'll try to make that clearer in the RFD.

I do not understand your suggestion that we should use a more complex quorum solution vs active/passive. I'm not sure how what that would be or how it would be better. I also do not know what "clustering software" you're referencing. Is that the whole concept presented in the RFD, the heartbeater, how the mako IP is flipped, something else?

askfongjojo commented 5 years ago

use a more complex quorum solution

I think what @mgerdts meant is that the 1:1 active/passive model may have the drawback of unnecessary churns when there is a single connectivity issue between them. Having more members to form a quorum can reduce the likelihood of that but it's a more complex thing to do (cluster management). I think it's going to be more expensive too (more than 1 passive servers per cluster?).

jjelinek commented 5 years ago

use a more complex quorum solution

I think what @mgerdts meant is that the 1:1 active/passive model may have the drawback of unnecessary churns when there is a single connectivity issue between them. Having more members to form a quorum can reduce the likelihood of that but it's a more complex thing to do (cluster management). I think it's going to be more expensive too (more than 1 passive servers per cluster?).

OK, I hope I addressed that by clarifying that a network issue for the heartbeat also implies a network issue for the iscsi traffic, so it depends on exactly what failed. I don't see the active/passive constantly flipping as a significant risk. Maybe I just can't think of a failure mode that could cause that. A bigger issue is the failure of the entire network within the storage group since that takes out 20 makos at once. I have that listed under the failure testing section but I will highlight it more in the layout discussion.

jjelinek commented 5 years ago

We will have one machine which assumes the active role (it has the manta data zpool imported and is providing the mako storage zone service), and one machine in the passive role (it has iscsi configured to see all 20 disks, but does not have the zpool imported and the mako storage zone is not booted until the active machine fails).

It seems rather expensive to have 2N servers, with half of them as standbys. It would seem better to have a fewer number of standby servers with the ability to float iscsi initiators around the pool of servers. Such an architecture would probably need to be able to cope with more failures than expected by doubling up pools on servers, which has implications that may or may not be hard to deal with.

Having distinct pairs of active/passive machines is much easier to install, manage and reason about when it comes to failures. It is simple and clear to understand what happens when an active machine dies. Once we lose that pairing, there has to be some other HA thing that can manage the failover, configure the passive machine appropriately, etc. I think this should be considered under a complete storage unit cost analysis, although the solution you're proposing will add new failure modes and cost in other ways, along with additional time before the solution could be ready.

Raw disks

If we want each iscsi target disk to map directly to a physical disk, then our current shrimps (repurposed as iscsi target boxes) have 36 disks, 34 of which could be used for iscsi target storage, leaving 2 which would be mirrored system disks (the 'zones' zpool). Having 34 disks is a bad number in a 20-wide configuration, so one option is to remove 14 disks and use them elsewhere.

Can you clarify what layout is problematic? The diagram makes it look like there will be 20 shrimp, each one mapping each disk to one stripe of a 17+3 raidz3 pool. If the shrimp have N data disks, I'd expect that there are N pools in a matching number of active servers.

Yes, I think that is correct. The thing that is problematic is having an iscsi target with 36 disks when we can only make use of 20 data disks and the 2 system disks since there are 14 disks we cannot use.

Just adding a note for posterity that I had a conversation with Mike on this and I now understand his point and my confusion. I'll be updating the RFD to explain things better, along the lines of what Mike is describing.

jjelinek commented 5 years ago

Closed by mistake.

jlevon commented 5 years ago

As you mention if a shrimp goes down all zpools will be resilvering. Do we need to be worried about all of this happening at once? Worth explicitly calling out in testing?

jlevon commented 5 years ago

How would we do rolling shrimp maintenance? Would this involve 20 "update; wait for resilver across the storage group;" cycles?

jlevon commented 5 years ago

With ZFS sitting on top of iSCSI we lose FMA in both directions (faults, blinkenlights etc). Would we need to consider a transport between the shrimps and the makos?

jlevon commented 5 years ago

Heartbeat: other systems have kept the heart-beat on disk on the basis that it's a closer reflection of the health of the system (it's not much use responding to ping if your storage stack is piled up on some iscsi CV). Worth consideration?

jlevon commented 5 years ago

MMP: admittedly I'm going off old slides, so I have no idea of the current status, but it sounds like this essentially snoops on the uberblock and such to decide on forced takeovers. And more importantly, there's no disk-level IO fencing on takeover. Given how catastrophic a dual import would now be, I'm wondering if it's worth considering a sideband way to properly fence off the storage.

jlevon commented 5 years ago

(To expand on that last comment a little: it's entirely feasible that a mako's iscsi stack gets completely stuck for multiple minutes, thus appearing to be dormant. Post takeover, we'd want something for the loser to lose all access to storage, in case it decides to unglue itself and start doing I/O again. AIUI from the MMP slides, there is no active post-import multihost detection.)

jlevon commented 5 years ago

Mako maintenance: just for clarification, we are stating that when we need to maintain makos, users can expect an outage of 1 minute (or whatever we end up with) ? Presumably this is predominantly import time. Do we have numbers for what this looks like currently?

mgerdts commented 5 years ago

(To expand on that last comment a little: it's entirely feasible that a mako's iscsi stack gets completely stuck for multiple minutes, thus appearing to be dormant. Post takeover, we'd want something for the loser to lose all access to storage, in case it decides to unglue itself and start doing I/O again. AIUI from the MMP slides, there is no active post-import multihost detection.)

There seem to be several options here.

FWIW, Veritas cluster uses a mechanism where a kernel module heartbeats on a dedicated (LLT - Low Latency Transport) network with a protocol called gab. If a node stops hearing from other members for a set number of seconds (30 by default, I think), the llt or gab module will call panic(). I believe there's a way where the master can also send a message over the LLT network to eject members.

We could build a similar mechanism using etcd leases and watches. Surely there's a way to do something quite similar with zookeeper.

danmcd commented 5 years ago

Is there a typo in the paragraph starting with, "An additional point of comparison"? I think you meant to say 34, but 54 is in that paragraph.

rejohnst commented 5 years ago

With ZFS sitting on top of iSCSI we lose FMA in both directions (faults, blinkenlights etc). Would we need to consider a transport between the shrimps and the makos?

It's somewhat complicated by the fact that there are basically 3 seperate illumos entities performing some form of FMA for disks. There's ZFS FMA, which specifically detects faults related to ZFS constructs (vdevs, pools). Then there's the scsi target driver, which produces FMA telemetry as part of handling SCSI transactions. And finally there's an fmd module that periodically polls the SMART status on disks to check for various conditions that indicate a bad disk (predicted failure asserted or overtemp or drive POST failure). In fact it's not uncommon for a drive failure to result in two or three FMA faults as the different entities each independently diagnose the failure from their vantage point. On the makos, the later two cases should continue to work as before. I honestly don't know how well the ZFS FMA stufff would work on the initiator hosts in this scenario.

What is also different from the current situation is that a drive failure in this scenario will result in faults events spread across multiple machines - potentially ZFS and SCSI diagnosed faults on the initiator hosts and SMART and SCSI faults diagnosed on the target host. So definitely some implications for OPS and for our monitoring software.

danmcd commented 5 years ago

Apologies if I don't have the Manta fundamentals down properly. If anything here is invalidated by me not understanding Manta, I withdraw. (Maybe pointers to how manta works?)

So after pass 1, I think I can't say MUCH more beyond what's been said above, except for two things.

First, I think you need two more pictures. The first would show the current situation (with its 38% effeciency) showing two datacenters, and each shrimp (and its matching mako server) having an evil twin in another DC^H^HAZ. And the second would show the new world order of having a shrimp, showing that it services N virtual mako servers, and how the disks are assigned to the different mako servers.

The second thing: An in-one-place breakdown of resilience and the costs incurred. If I were to guess:

Old way: Survives disk failures per raidz2. Survives shrimp failure by having whole other copy in the other AZ. High performing, simple to diagnose, and has fewer failure modes (good) but fewer failure points (bad).

New way: Survives disk failures AND shrimp failures per raidz3. Has no-other-AZ backup, but could be added. Adds iSCSI failure modes to existing ones, but gains more points of failure that need to take down the virtualized mako server.

I'd be very curious to see the probabilities of failures (as a function of what things CAN fail and their cost) in both ways.

jjelinek commented 5 years ago

As you mention if a shrimp goes down all zpools will be resilvering. Do we need to be worried about all of this happening at once? Worth explicitly calling out in testing?

Yes, we do need to be sure this is fine. I'll add it to the testing section.

jjelinek commented 5 years ago

How would we do rolling shrimp maintenance? Would this involve 20 "update; wait for resilver across the storage group;" cycles?

I think that would be right. I'll add this to the maintenance section.

jjelinek commented 5 years ago

With ZFS sitting on top of iSCSI we lose FMA in both directions (faults, blinkenlights etc). Would we need to consider a transport between the shrimps and the makos?

This is a good point. I'll call it out as an explicit area for complete testing and potential future project work. I'm also wondering if any of the Nexenta iscsi work might have improved this situation? I need to explore that work fully.

jjelinek commented 5 years ago

Heartbeat: other systems have kept the heart-beat on disk on the basis that it's a closer reflection of the health of the system (it's not much use responding to ping if your storage stack is piled up on some iscsi CV). Worth consideration?

This is exactly how mmp works so we do have that heartbeat in the zpool already. It seems simpler to use a network connection heartbeat to trigger the attempt to import on a flip instead of just trying to import on the passive every few seconds.

jjelinek commented 5 years ago

MMP: admittedly I'm going off old slides, so I have no idea of the current status, but it sounds like this essentially snoops on the uberblock and such to decide on forced takeovers. And more importantly, there's no disk-level IO fencing on takeover. Given how catastrophic a dual import would now be, I'm wondering if it's worth considering a sideband way to properly fence off the storage.

So far I don't see this as necessary. mmp does the right thing here and suspends the zpool on the active machine if it hasn't been able to update the uberblock within the heartbeat window. I have tested this and it works as expected, allowing us to import on the old passive machine even though the old active still has the zpool imported. There is things we'll want to do to make this work seamlessly, but mmp makes sure there is no dual-writer problem.

jjelinek commented 5 years ago

(To expand on that last comment a little: it's entirely feasible that a mako's iscsi stack gets completely stuck for multiple minutes, thus appearing to be dormant. Post takeover, we'd want something for the loser to lose all access to storage, in case it decides to unglue itself and start doing I/O again. AIUI from the MMP slides, there is no active post-import multihost detection.)

No, mmp detects this if the IO comes back and suspends the zpool. We will probably want to reboot the old active machine so it starts fresh as a passive mako. I'll add more details in the RFD to make this clear.

jjelinek commented 5 years ago

Mako maintenance: just for clarification, we are stating that when we need to maintain makos, users can expect an outage of 1 minute (or whatever we end up with) ? Presumably this is predominantly import time. Do we have numbers for what this looks like currently?

We do not have numbers yet. The import time is longer, but on the order of 10 seconds (this is tunable). The zone boot and svc startup is also on the order of a few seconds. I am mostly worried about network propagation for the new mako IP or MAC address transition so muskie starts talking to the correct server.

jjelinek commented 5 years ago

(To expand on that last comment a little: it's entirely feasible that a mako's iscsi stack gets completely stuck for multiple minutes, thus appearing to be dormant. Post takeover, we'd want something for the loser to lose all access to storage, in case it decides to unglue itself and start doing I/O again. AIUI from the MMP slides, there is no active post-import multihost detection.)

There seem to be several options here.

  • Have zpool import <pool> take a SCSI reservation on all pool disks such that anyone else that has visibility cannot write to them. Alternatively, this could be done by some third-party orchestrator.
  • On Mako CNs, have an iscsi initiator per pool. During a failover, orchestration will modify a host group or view on each iscsi target server to remove the per-pool initiator associated with the dead CN and add the per-pool initiator associated with the CN that is taking over. See stmfadm(1M).
  • Figure out a STONITH mechanism that allows sending (and verifying reception of) an NMI or cutting power prior to forced import.

FWIW, Veritas cluster uses a mechanism where a kernel module heartbeats on a dedicated (LLT - Low Latency Transport) network with a protocol called gab. If a node stops hearing from other members for a set number of seconds (30 by default, I think), the llt or gab module will call panic(). I believe there's a way where the master can also send a message over the LLT network to eject members.

We could build a similar mechanism using etcd leases and watches. Surely there's a way to do something quite similar with zookeeper.

  • Each Mako server would continually update a per-Mako-server lease (say, with duration 30s). If a Mako server can't update its lease in that time it would make the appropriate uadmin system call to induce a panic.
  • A daemon running on the iscsi target servers would watch for variables associated with the lease to become unset. That's a trigger to each iscsi target server that it should remove access from the initiators associated with Mako server that just lost its lease and allow access by the failover server. Upon making this switch, each iscsi target server will update something in etcd to indicate that the transition is complete.
  • A daemon running on the Mako failover server (which may have one or more active Mako zones already) will see the failure of its peer. It will wait for indications from the iscsi target servers that their actions are complete and then initiate the forced import and zone bringup.

So far, I haven't seen any reason for this added complexity since mmp is doing the "right thing" but if we find situations which need more handling, we'll have to revisit this.

jjelinek commented 5 years ago

Is there a typo in the paragraph starting with, "An additional point of comparison"? I think you meant to say 34, but 54 is in that paragraph.

The typo is actually in the previous paragraph which should say 17. I'll fix this.

jjelinek commented 5 years ago

Apologies if I don't have the Manta fundamentals down properly. If anything here is invalidated by me not understanding Manta, I withdraw. (Maybe pointers to how manta works?)

So after pass 1, I think I can't say MUCH more beyond what's been said above, except for two things.

First, I think you need two more pictures. The first would show the current situation (with its 38% effeciency) showing two datacenters, and each shrimp (and its matching mako server) having an evil twin in another DC^H^HAZ. And the second would show the new world order of having a shrimp, showing that it services N virtual mako servers, and how the disks are assigned to the different mako servers.

The second thing: An in-one-place breakdown of resilience and the costs incurred. If I were to guess:

Old way: Survives disk failures per raidz2. Survives shrimp failure by having whole other copy in the other AZ. High performing, simple to diagnose, and has fewer failure modes (good) but fewer failure points (bad).

New way: Survives disk failures AND shrimp failures per raidz3. Has no-other-AZ backup, but could be added. Adds iSCSI failure modes to existing ones, but gains more points of failure that need to take down the virtualized mako server.

I'd be very curious to see the probabilities of failures (as a function of what things CAN fail and their cost) in both ways.

I'll work on a new section which details the various failure modes we have thought of so far and compares them to the old/new approach.

jlevon commented 5 years ago

Heartbeat: other systems have kept the heart-beat on disk on the basis that it's a closer reflection of the health of the system (it's not much use responding to ping if your storage stack is piled up on some iscsi CV). Worth consideration?

This is exactly how mmp works so we do have that heartbeat in the zpool already. It seems simpler to use a network connection heartbeat to trigger the attempt to import on a flip instead of just trying to import on the passive every few seconds.

This is probably fine if we tie the heartbeat into the the storage stack somehow (i.e. we only beat if we're actively able to r/w storage in some manner). Otherwise we end up with the healthy-heartbeat, dead-I/O path situation.

jlevon commented 5 years ago

Re: MMP, I guess maybe it's bulletproof? I'm not sure quite how MMP could prevent corruption in this case - it has zero control over the recovered other node without STONITH or fencing - but perhaps the nature of how zfs writes work means that this is in fact totally safe and we can rely on this never breaking in the future.

(I think it has to be "never" as this is a massive-data-loss scenario, right?)

jjelinek commented 5 years ago

Heartbeat: other systems have kept the heart-beat on disk on the basis that it's a closer reflection of the health of the system (it's not much use responding to ping if your storage stack is piled up on some iscsi CV). Worth consideration?

This is exactly how mmp works so we do have that heartbeat in the zpool already. It seems simpler to use a network connection heartbeat to trigger the attempt to import on a flip instead of just trying to import on the passive every few seconds.

This is probably fine if we tie the heartbeat into the the storage stack somehow (i.e. we only beat if we're actively able to r/w storage in some manner). Otherwise we end up with the healthy-heartbeat, dead-I/O path situation.

Yes, I think we want to make sure the heartbeat responder is periodically doing zfs status checks too so if the zpool is suspended, we stop acking the heartbeat. I'll make this explicit.

jjelinek commented 5 years ago

Re: MMP, I guess maybe it's bulletproof? I'm not sure quite how MMP could prevent corruption in this case - it has zero control over the recovered other node without STONITH or fencing - but perhaps the nature of how zfs writes work means that this is in fact totally safe and we can rely on this never breaking in the future.

(I think it has to be "never" as this is a massive-data-loss scenario, right?)

I'd hesitate to call any piece of SW bulletproof, but so far it seems to do exactly what it should. It's actually not that complex. If the txg sync can't write within the mmp timeout, it suspends the zpool in memory and the pool is no longer able to accept any write activity. This zpool state is now valid for a zpool forced import on another machine. If the txg's are being written when multihost is enabled, forced import is not allowed on another machine. Of course, if something goes around zfs and writes to the raw iscsi device, mmp cannot help us.

mgerdts commented 5 years ago

Re: MMP, I guess maybe it's bulletproof? I'm not sure quite how MMP could prevent corruption in this case - it has zero control over the recovered other node without STONITH or fencing - but perhaps the nature of how zfs writes work means that this is in fact totally safe and we can rely on this never breaking in the future. (I think it has to be "never" as this is a massive-data-loss scenario, right?)

I'd hesitate to call any piece of SW bulletproof, but so far it seems to do exactly what it should. It's actually not that complex. If the txg sync can't write within the mmp timeout, it suspends the zpool in memory and the pool is no longer able to accept any write activity. This zpool state is now valid for a zpool forced import on another machine. If the txg's are being written when multihost is enabled, forced import is not allowed on another machine. Of course, if something goes around zfs and writes to the raw iscsi device, mmp cannot help us.

Per mmp.c, once enough time has passed without being able to update uberblocks, zio_suspend(spa, NULL, ZIO_SUSPEND_MMP) is called. This makes it so that spa_suspended() returns B_TRUE. However, spa_suspended() seems to not be in every write path. I don't think new transactions will be added to a transaction group when it is suspended. I am not convinced that a txg that is already in the syncing context when a pause occurs will check spa_suspended() before issuing every write. Regardless of if it does, it could be that a small number of writes were paused in kernel outside of the zfs module's control. I've not looked at resilver, which could also be doing a lot of I/O without frequent calls to spa_suspended().

failmode=panic may help. As seen in zpool(1M):

     failmode=wait|continue|panic
             Controls the system behavior in the event of catastrophic pool
             failure.  This condition is typically a result of a loss of
             connectivity to the underlying storage device(s) or a failure of
             all devices within the pool.  The behavior of such an event is
             determined as follows:
             ...
             panic     Prints out a message to the console and generates a
                       system crash dump.

With that, when MMP detects trouble the panic should happen quickly enough that a syncing txg will be quickly interrupted and writes are prevented before the failover machine takes over. However, if a kernel-wide pause that triggered MMP to suspend the pool was longer than double the timeout, there may be a race between threads that are performing normal writes anywhere on any disk and the thread that is updating MMP stamps.

jjelinek commented 5 years ago

Re: MMP, I guess maybe it's bulletproof? I'm not sure quite how MMP could prevent corruption in this case - it has zero control over the recovered other node without STONITH or fencing - but perhaps the nature of how zfs writes work means that this is in fact totally safe and we can rely on this never breaking in the future. (I think it has to be "never" as this is a massive-data-loss scenario, right?)

I'd hesitate to call any piece of SW bulletproof, but so far it seems to do exactly what it should. It's actually not that complex. If the txg sync can't write within the mmp timeout, it suspends the zpool in memory and the pool is no longer able to accept any write activity. This zpool state is now valid for a zpool forced import on another machine. If the txg's are being written when multihost is enabled, forced import is not allowed on another machine. Of course, if something goes around zfs and writes to the raw iscsi device, mmp cannot help us.

Per mmp.c, once enough time has passed without being able to update uberblocks, zio_suspend(spa, NULL, ZIO_SUSPEND_MMP) is called. This makes it so that spa_suspended() returns B_TRUE. However, spa_suspended() seems to not be in every write path. I don't think new transactions will be added to a transaction group when it is suspended. I am not convinced that a txg that is already in the syncing context when a pause occurs will check spa_suspended() before issuing every write. Regardless of if it does, it could be that a small number of writes were paused in kernel outside of the zfs module's control. I've not looked at resilver, which could also be doing a lot of I/O without frequent calls to spa_suspended().

failmode=panic may help. As seen in zpool(1M):

     failmode=wait|continue|panic
             Controls the system behavior in the event of catastrophic pool
             failure.  This condition is typically a result of a loss of
             connectivity to the underlying storage device(s) or a failure of
             all devices within the pool.  The behavior of such an event is
             determined as follows:
             ...
             panic     Prints out a message to the console and generates a
                       system crash dump.

With that, when MMP detects trouble the panic should happen quickly enough that a syncing txg will be quickly interrupted and writes are prevented before the failover machine takes over. However, if a kernel-wide pause that triggered MMP to suspend the pool was longer than double the timeout, there may be a race between threads that are performing normal writes anywhere on any disk and the thread that is updating MMP stamps.

Setting up to panic like this sounds like a good idea. We can't export the zpool since it is suspended, so the only recovery (as zpool status even tells us) is to reboot. One thing in the comment above that I am confused about is the resilvering one. I am not sure how zfs would be able to initiate resilver writes outside of a txg. As far as I know, the only two write paths are txg sync and zil commit, but maybe there is something I am missing.

mgerdts commented 5 years ago

I am not sure how zfs would be able to initiate resilver writes outside of a txg. As far as I know, the only two write paths are txg sync and zil commit, but maybe there is something I am missing.

I know nothing about the resilver I/O path. There may be nothing special to worry about there at all.

ehocdet commented 5 years ago

Thanks for RFDs, it's very interesting.

. Manta compute The temporary storage will fit on the same zpool as Manta data or could fit on "zones" zpool?

. (no)SLOG To limit the possible impact of the absence of SLOG, could be interesting to consider logbias=throughput.

. iscsi ? Using shared JBODs with SAS target might be an alternative? It could fit well in this usage. We have been using such an infrastructure for 9 years with success: 8 actives head for 1 (manual) failover per group of JBOD (build on top opensource Nexenta).