thoughts on scaling out the router part in a multitenant environment

0xgeert commented 11 years ago

Heroku is now facing problems with this.

They do random routing instead of least connection routing, bc. least connection routing would not scale.

So why would that be the case? Probably due to the cap theorem, you need to make tradeoffs, between accuracy and scalability:

A good discsusion on HN, link with part of the discussion here: https://news.ycombinator.com/item?id=5490690

0xgeert commented 11 years ago

C - Consistency: nodes agree on the system's state. A - Availability: the system accepts requests. P - Partition tolerance: the system runs even when the network delays or drops some messages.

with a methaphor. nice read

http://ksat.me/a-plain-english-introduction-to-cap-theorem/

0xgeert commented 11 years ago

Great piece:

http://aphyr.com/posts/277-timelike-a-network-simulator http://aphyr.com/posts/278-timelike-2-everything-fails-all-the-time DISCUSSION: http://www.reddit.com/r/programming/comments/18tox4/timelike_a_parallel_system_network_simulator/

BASICALLY BUILDS CP-ROUTERS (no state communicated) OVER AP PARTITIONS OF WORKERS.

The system must be partition tolerant.
The system must be available–as much as possible, anyway. Serving web pages slower is preferable to not serving them at all. In the language of CAP, our system must be AP.
But we can't wait too long, because requests which take more than a minute to complete are essentially useless. We have a liveness constraint.
Requests must complete correctly, or not at all. We can't route an HTTP POST to multiple servers at once, or drop pieces of requests on the floor. We have a safety constraint.

GREAT ON TESTING NODES WITH FAILURES AND SEE WHAT HAPPENS.. THIS IS USED AS A BACKGROP TO TELL WHY DISTRIBUTED SYSTE,MS ARE HARD TO BUILD

GREAT TO GO ABOUT STRESS TESTING. (IN OUR SITUATION A DYNO IS ONE NODE-PROCESS. MULTIPLE NODE PROCESSES MAY RUN ON 1 BOX)

0xgeert commented 11 years ago

Referencd in above comments:

http://brooker.co.za/blog/2012/01/17/two-random.html About how eventually consistent state (caching least connections ) leads to herding behavior: load balancers flock /herd requests to node with least amount of work, much longer than it takes that node to become fully saturated. Resulting in a budy -> free -> busy pattern. Not good.

BTW: this is when sharing state between load balancers: The Timelike2 piece doesn't do state sharing at all, so no herding on that one.

Instead if proposes (also possible on cached state data (connections per node)) to choose 2 random nodes and assign request to node with smallest nr of connections. This approach works better for a long time than purely random (offset against time of staleness of the data)

REMEMBER TOTALLY DIFFERENT APPROACH FROM COMMENT ABOVE: ABOVE COMMENT IS STATELESS LOADBALANCING THIS COMMENT IS (CACHED) STATEAWARE LOADBALANCING

0xgeert commented 11 years ago

Least connection LB's may perhaps do better in our case:

From http://aphyr.com/posts/277-timelike-a-network-simulator An optimal algorithms would predict the future and figure out how long the request will take before allocating it–so it could avoid stacking two long-running requests in the same queue.

SO LET'S DO THAT :)

Since all requests are done through a DSL, PERHAPS some better estimates may be given as to how costly a particular request is beforehand based on it's DSL-footprint.

Then, a request could be divided into a number of 'ticks' indicating the complexity of the equest. (E.g: each 'tick' assumes 25ms processing time, and a hard request may be 10 ticks = 250ms)

This allows Least Tick LB's.

This COULD POTENTIALLY EVEN BE A PRICING MODEL. SINCE TYPE OF REQUEST * DOCSIZE MAY IMPLY TICKS.

Therefore some calculator may exists for consumers to do some math on cost given types of requests, and doc-size. This would all be based on heuristics, and the system would feed itself: i.e: each request will actually be measured and stored to some offline processing db / hadoop or something, to measure the actual cost in ticks aggregated over requests.

Of course, a particular user will not be billed on the actual costs of it's requests in ticks but the prognosed cost in ticks. This smoothes out any dicrepancies bc. of noisy neighbors, etc (although we're minimizing the impact of that, see XXX)

0xgeert commented 11 years ago

ON WHY SHARING STATE BETWEEN ALL LB'S IS LIKELY IMPOSSIBLE (COMMENT BY AUTHOR IN http://aphyr.com/posts/277-timelike-a-network-simulator)

In theory, the top-level routing system is globally distributed, which means you could face inter-dc latencies for consensus; at least one round trip and likely more than that. There's a reason folks don't split Dynamo over multiple datacenters without special rack-aware reasoning; on inhomogenous networks you can get spectacular tail latencies. The problem is that the dynamics of the system–like dyno queue depths–varies on a shorter timescale than an inter-DC system can reach consensus, which renders any consensus-based approach useless. The only solutions I can think of are

1.) Go stateless 2.) Be stateful, but only over dynamics that change slowly, like overall DC load 3.) Localize state to a system with tight latency bounds, like a single DC or machine.

In practice, your DNS system is already balancing traffic on a geographic basis, which helps you choose 3); a hybrid stateless/short-latency CP system is what I describe in the second post.

0xgeert commented 11 years ago

Cloudant is a great commercial model

Based on communicated GB/month (1$ per GB)

Req/s: read: $0.015 / 500 write $0.015 / 100

E.g: a tenant doing: 20 req/s minute ( 1/3 req /sec) with 10kb per response would be

reqs: 26 Gb : 9

puts : X

40 $ / month /tenant

A 1 dollar box / hour on ec2 should be able to withstand 50 reqs / sec nonstop at a minimum (guess it's at least 100 though)

this costs: $720 / m

we need a X3 redundancy however => $2200 / m

at optimum capacity we could have (with 50 reqs/s) 150 tenants / box = 6000 / m

at 75% capacity (which we should be able to auto-plan) => $4500 /m

0xgeert commented 11 years ago

A more realistic one is:

divide 25% of reqs through ES (80 / req / sec) 50% couchbase (1000 req / sec) 25% Redis / cache (10000 req / sec )

For argsake say: 200 reqs/sec 75% capcity would give 18000 / box / mo (600 tenants)

of course now the bottleneck is RAM probably. (serving ES and Redis on separate boxes it's easier to finetune characteristics)

STill on scale, this should all be possible

0xgeert commented 11 years ago

Have an incentive for users to specify a cache-time. (lowering pricing per request in half, for all requests hitting the cache?)

Bad for bottom margin.

INSTEAD: MAKE THE CASE THAT SETTING CACHE-SIGNALES IS GOOD FOR LATENCY/ THROUGHPUT, WHICH IT IS. THAT SHOULD BE ENOUGH

0xgeert commented 11 years ago

From: http://www.dbms2.com/2012/06/03/introduction-to-cloudant/

Cloudant’s clustering scheme is much as you’d expect: Consistent hashing. RYW quorum consistency, with a default of 3 copies (across 2 data centers), 2 reads, and 2 writes.

RYW -> http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/

Multi-tenant servers still use disks (as opposed to solid-state storage). Single-tenant customers can choose among various different configurations.

0xgeert / Canonical

thoughts on scaling out the router part in a multitenant environment #1

puts : X