Testing Storage Server mode for risk-free testing of experimental storage engines in a production environment.

satherton commented 4 years ago

An upcoming FDB feature called Testing Storage Server (TSS) mode enables risk-free performance and verification testing of an unproven storage engine in a live FDB cluster. It would be safe to use in a QA or production environment without any risk of data loss or incorrect behavior.

To enable TSS mode, the cluster is configured with a testing Storage Engine type and a count of processes to use for testing. The cluster will respond by excluding and removing some recruited SS processes and recruiting TSS processes to replace them (see Perpetual Wiggle design for more details on when/how storage servers will be excluded). For each recruited TSS process, the cluster will assign to it a normal Storage Server (SS) partner, which is also initially empty. Shards are moved to both the TSS and SS in the pair at the same time. Clients will be aware of the TSS processes, and any requests sent to a paired SS will also be sent to its TSS partner. Clients will still choose randomly from the same SS replicas for any given shard, and then only send the request to a TSS if the SS chosen happens to have a TSS partner. The SS response is returned to the user when available, and the TSS result is compared to it when available.

This design results in the TSS having exactly the same workload as its paired SS while not being responsible for durability or availability. Specifically

The TSS processes the same mutations in the same order as the SS.
The TSS serves the same reads requests (from users or from Data Distribution) as the SS and from the same peers.
The TSS responses are never used by any peer other than to detect and report differences between it and the SS response.
The shard replicas held by TSS processes are in excess of the cluster’s configured replication factor, so the TSS shards are neither required for nor used to provide durability.

The safety and correctness of the TSS feature will be proven in simulation by intentionally injecting bad behavior (for example, randomly skip mutations) the Storage Engines that the TSS processes are using to verify that this does not cause any incorrect cluster behavior.

xumengpanda commented 4 years ago

The description above covers the correctness of the roll out. I think we also need to make sure the live cluster's performance does not degrade (too much) during storage server testing and roll out.

The simple solution above will affect the read throughput obviously. So I think the improved solution is definitely better.

A simpler solution for the improved version can just pair the new storage engine's SS with an old one's SS, and "plug-in" it to the old SS's (~15) teams. Handling the dynamics (creation and removing) of SS and its teams may involve trick corner cases to tackle.

sfc-gh-satherton commented 3 years ago

I've updated this PR description with the final design of the TSS feature.

dongxinEric commented 3 years ago

A totally unscientific performance test of the TSS would be comparing not only the correctness of the data returned from SS and its paring TSS, but also the observed latency of the requests from the client side.

sfc-gh-satherton commented 3 years ago

@dongxinEric Indeed, performance comparison is very much in scope here. This is a main reason that TSS+SS pairs start empty together. From that point forward, they will do all of the same work as close to identically as possible, so clients will be able to produce very useful metrics on how they behave, in addition to the cluster's storage related metrics.

apple / foundationdb

Testing Storage Server mode for risk-free testing of experimental storage engines in a production environment. #3155