Isolators: wording around request and limit is tricky

thockin commented 9 years ago

CPU: Google defines request as "you are granted access to this much" and limit as "you might be able to get this much, at reduced QoS up to and including throttling". It really has to be scary for a user to enter that territory.

Memory: You say "over this limit will be reclaimed", but that's not right or useful. Page reclaim is running all the time, which is exactly what you want. If you never reclaim you are guaranteed to have poor efficiency. Google defines request as "you are granted access to this much" and limit as "you might be able to get this much, at reduced QoS up to an including sudden death". It really has to be scary for a user to enter that territory.

For all resources that have dual limits, we should think hard about the words that convey the main idea - QoS bands - without detailing the implementation (like reclaim). Once people depend on the implementation, we're locked into whatever decisions we already made. Trust me - it sucks.

globalcitizen commented 9 years ago

It's not the wording, it's the whole damn concept.

The point of #390 (rather aggressively closed, though perhaps poorly worded) was that request and limit do not apply to all subsystems and that there are in fact two competing concerns: start-time ("need X or don't start") and run-time ("handle out-of-resource issues").

For start-time issues, images could usefully spec their minimum resource on a per-subsystem basis (eg. X memory, Y CPU, Z storage volume, B network bandwidth).

Run-time issues are a completely separate question, only certain subsystems actually have usage-driven burstable resource utilization (eg. CPU and network), others (such as storage and memory) do not. Some are run-time modifiable (eg. CPU and network), others are not (eg. storage on standard filesystems). Another question that remains inexplicit but related to this area is run-time detectability of dangerous utilization levels, and after that the ultimate resource failure modes. Consider the difference between:

the container filling up its filesystem (realistic to detect ahead-of-time in many cases, and likely to result in a crash after a soft warning/grace period of heavily reduced performance / possible abnormal behavior)
the container running out of memory (straight up crash due to random process reaping)
the container receiving reduced block IO performance to a disk it wants to write to
the container reaching its bandwidth limit
the container reaching its CPU utilization limit

I would argue this whole area needs a rethink with at least these factors in mind. As #390 stated, the current notion is conceptually broken.

globalcitizen commented 9 years ago

See also #359 on CPU resource limits.

thockin commented 9 years ago

With all due respect, a LOT of experience has gone into the two-level limit system, and it is deployed (albeit in a slightly different form) in billions of containers here. It works and is very powerful, but it does need to be described and implemented carefully.

On Fri, May 8, 2015 at 1:02 AM, Walter Stanish notifications@github.com wrote:

See also #359 https://github.com/appc/spec/issues/359 on CPU resource limits.

— Reply to this email directly or view it on GitHub https://github.com/appc/spec/issues/361#issuecomment-100143019.

globalcitizen commented 9 years ago

That's a pretty weak and general answer to a series of specific points. I have no idea who you are or where "here" is (I guess Google?), nor do I really care. I would request that you make your argument based on actual evidence instead of "trust me" / "it's how we've always done it". As it stands, I am forced to wonder why these points are getting shut down instead of responded to even after being re-raised.

globalcitizen commented 9 years ago

As further illustration of the problems of trying to unify all subsystems in to a single structure here I would point out that:

An interface-oriented, multi-interface supporting networking configuration (desirable) would demand limits for networking related resources be applied differently against multiple interfaces
Historically, one of the most frequently measured and controlled resource limits has been storage. However, there are now so many approaches to storage, there is going to need to be a different treatment on shared/burstable/resizable volumes running modern things like ZFS or distributed filesystems, and on traditional extN-style filesystems which cannot be real-time resized. This demands limits for storage be applied differently against multiple filesystems that are mounted in to the container.

So there's two further, very clear mismatches between popular subsystems and the current approach, which strikes me as fundamentally flawed.

thockin commented 9 years ago

I can quote evidence at you all day long, but you don't have to believe me because it's all internal evidence I can't show you.

Two grades of service for certain classes of resource (particularly CPU and memory) allow all sorts of things that are desirable.

Being able to burst CPU if it is available gives better user latency, utilization, and power usage. It's not a hard guarantee, but it can be sold with reasonable probability and an appropriate SLA.

Being able to burst memory allows jobs under the control of an auto-pilot system to temporarily exceed their request, if possible, rather than OOM.

This is part of a way to describe graduated QoS for jobs - which is a big part of how you can achieve respectable DC utilization numbers.

It's not "how we've always done it" it's how we figured out to do it after trying other approaches. Hopefully specifications like this are built on the backs of the people who tried things and failed and tried again. A big part of the reason we (Google) are involved at all is to help the world not waste time on things that don't work so well.

This does not apply equally to all subsystems, of course, but it's fundamental enough that the spec does need to address it, IMO.

On Fri, May 8, 2015 at 5:54 PM, Walter Stanish notifications@github.com wrote:

That's a pretty weak and general answer to a series of specific points. I have no idea who you are or where "here" is (I guess Google?), nor do I really care. I would request that you make your argument based on actual evidence instead of "trust me" / "it's how we've always done it". As it stands, I am forced to wonder why these points are getting shut down instead of responded to even after being re-raised.

— Reply to this email directly or view it on GitHub https://github.com/appc/spec/issues/361#issuecomment-100402268.

globalcitizen commented 9 years ago

OK so essentially you are saying for Google workloads inside of Google that probably emphasize density, Google likes this setup.

That's fine but it's really a question of scheduling vs. workload vs. resources vs. business goals... not something I'd personally consider trying to 'one size fits all'.

I would say that since essentially the ACE should do the scheduling (its in the best position to do so), the container-specific metadata (this repo) should probably take a standpoint of providing assisting information rather than assuming the ACE uses a certain scheduling or resource management model. That means, flexible means to provide subsystem-specific context for container resource use for input to the ACE scheduler - required, probable, and maximum... and potentially where applicable or workable, specific recovery and failure modes.

thockin commented 9 years ago

I appreciate your enthusiasm, and would welcome some concrete proposals, in the form of a pull request, which we could discuss. What we have now is the result of experience and long discussions and what-if.

On Sat, May 9, 2015 at 4:23 PM, Walter Stanish notifications@github.com wrote:

OK so essentially you are saying for Google workloads inside of Google that probably emphasize density, Google likes this setup.

That's fine but it's really a question of scheduling vs. workload vs. resources vs. business goals... not something I'd personally consider trying to 'one size fits all'.

I would say that since essentially the ACE should do the scheduling (its in the best position to do so), the container-specific metadata (this repo) should probably take a standpoint of providing assisting information rather than assuming the ACE uses a certain scheduling or resource management model. That means, flexible means to provide subsystem-specific context for container resource use for input to the ACE scheduler - required, probable, and maximum.

— Reply to this email directly or view it on GitHub https://github.com/appc/spec/issues/361#issuecomment-100559741.

globalcitizen commented 9 years ago

Well, I do think if the so-called "result of experience and long discussions" is not accessibly documented its validity is questionable. So what we have right now is actually 'what-if' and 'excuse me that what-if has identifiable issues' - which is the purpose of an issues database, no?

Reading between the lines Google has an existing investment in Google's cloud platform and it probably expects these kinds of inputs at present. So one could expect push-back from Google with respect to altering them, if they work right now. However, I don't expect an unwillingness to share reasoning. If that is what is encountered, then it compromises Google's capacity to claim good faith participation.

thockin commented 9 years ago

I thin we're open to we'll reasoned changes. On May 10, 2015 1:09 AM, "Walter Stanish" notifications@github.com wrote:

Well, I do think if the so-called "result of experience and long discussions" is not accessibly documented its validity is questionable. So what we have right now is actually 'what-if' and 'excuse me that what-if has identifiable issues'.

— Reply to this email directly or view it on GitHub https://github.com/appc/spec/issues/361#issuecomment-100600323.

jonboulle commented 9 years ago

I wrote a long reply a while ago but apparently never successfully submitted it. I'm not sure it's going to be constructive for me to reproduce it again, so let me just summarise by echoing Tim's point that we are very open to substantive proposals of how to improve things here. On the face of it it sounds like you might be receptive to a three-pronged (required/probable/maximum) approach which I would argue is not far at all conceptually from where we are today, might just be clearer with some wordsmithing.

jonboulle commented 9 years ago

An interface-oriented, multi-interface supporting networking configuration (desirable) would demand limits for networking related resources be applied differently against multiple interfaces

I don't understand how this particular point relates to this issue, AFAICT it's just a facet of how the network bandwidth isolator is defined today (lack of granularity) - https://github.com/appc/spec/issues/278

appc / spec

Isolators: wording around request and limit is tricky #361