clearcontainers / proxy

Hypervisor based containers proxy
Apache License 2.0
32 stars 15 forks source link

[PROPOSAL] [RFC] KSM throttler #168

Open sameo opened 6 years ago

sameo commented 6 years ago

Definition

Kernel Same-page Merging (KSM) throttling is the process of regulating the KSM daemon by dynamically modifying the KSM sysfs entries, in order to minimize memory duplication as fast as possible while keeping the KSM daemon load low. KSM throttling currently uses container creation events as its single input.

Problem statement

Today's KSM throttling is part of the Clear Containers proxy code and is a passive component (i.e. it needs to be notified by other parts of the code). That is problematic for 2 main reasons:

  1. KSM throttling depends on a system wide proxy daemon to be running. As Clear Containers is moving towards a one proxy per VM architecture, the proxy code will no longer have a system wide view and will thus no longer be able to trigger the KSM throttling routine based on the overall container creation activity.

  2. The current KSM throttling code is very much Clear Containers proxy specific. It can not be extracted from the code base because it's passive, i.e. it relies on being actually called from other parts of the proxy implementation. It can also not be made a separate standalone component without callers modifying their code to call into a KSM throttling specific API.

Proposal

This proposal is about creating a generic, Clear Containers agnostic and active KSM throttling service: the KSM throttler.

Clear Containers agnostic

The KSM throttler will not be dependent on any Clear Containers piece of code, API or architecture design. As a matter of fact, we believe that a KSM throttler could not only benefit VM based containers but also generic/legacy VM based workloads where the goal would be to minimize memory duplication as quickly as possible.

Active component

The KSM throttler will by default be an active component, checking for different system wide values and settings in order to build informed KSM throttling decisions. In other words KSM throttler will by default not have to be explictly triggered by e.g. a VM based container runtime or proxy but will instead be actively watching for specific information about VM or VM based containers life cycles.

Passive Fallback

When a system can not provide a reliable source of information about VM life cycles, KSM throttler will provide a passive UNIX socket for components like container runtimes to notify it about VM or container specific events (creation, destruction, etc...)

Implementation

The KSM throttler implementation can be split into 2 parts: The throttling algorithm and the input sources handling.

Input sources

The KSM throttler will be able to handle several input sources and one should be able to add a new input source implementation to the source code fairly easily. In practice, a KSM throttling input source will watch any specific system wide component and will notify the KSM throttler about any new VM or VM based containers life cycle event. The KSM throttler is a server and all input sources are potential clients. We will use the gRPC protocol between the throttler and its clients, defined by the following proto file:

service KSMThrottler {
      rpc Register(RegisterRequest) returns (Empty) {}
      rpc Events(stream VMEvent) returns (stream VMEvent) {}
}

message RegisterRequest {
      string name = 1;
}

enum EventType {
      CREATING =  0;
      CREATED = 1;
      DESTROYING = 2;
      DESTROYED = 3;
}

message VMEvent {
     string vm_id = 1;
     EventType type = 2;
}

Each KSM throttler input source would first register against KSM throttler and then send a stream of events.

KSM Throttling

The initial throttling algorithm will follow the current proxy one, where we throttle KSM up on each VM creation and then progressively throttle it down as long as there are no new VM creation.

Phases

The implementation will follow an incremental process going through a few phases:

Phase 1: virtcontainers compatibility

The virtcontainers KSM throttler input source will be watching the virtcontainers pod filesystem through inotify in order to understand whenever a new Pod is created or destroyed.

Phase 2: Implement a fallback input source [TBD]

sameo commented 6 years ago

cc @jodh-intel @grahamwhaley @mcastelino @sboeuf @egernst

jodh-intel commented 6 years ago

This sounds great. I wonder if we need extra events to cater for memory hotplug/removal?

mcastelino commented 6 years ago

@sameo KSM acts on the global set of mergeable pages and global parameters. So having multiple input source types may not make sense. Also the VM lifecycle events in that case may not add too much value.

If the ksm daemon were to rely on the source just for kicks (completely agnostic of the reason for the kick) and then use the state of the ksm itself like pages_shared, pages_sharing, pages_unshared and pages_volatile to figure out how to throttle itself; the solution may be better and simpler.

For example if the daemon sees high/unchanging ratio of pages_unshared to pages_sharing when no kicks have been seen and full_scans exceeds the number of scans required to merge, it should throttle down.

Also the lifecycle internal to the VM itself (i.e. the application within the VM allocating memory), is an event that cannot be triggered by a source. However it is observable from ksm status as well as memory usage across the VM's active.

All of these derivable events can be inferred from sys/kernel/mm/ksm/ and /proc/meminfo

sameo commented 6 years ago

@mcastelino

If the ksm daemon were to rely on the source just for kicks (completely agnostic of the reason for the kick)

So the kick is the CREATED event and this is what the throttling routine will rely on (As it does today). Are you suggesting that we only provide a Kick() daemon API that the input source would call, instead of sending VM events ? That would be simpler indeed.

and then use the state of the ksm itself like pages_shared, pages_sharing, pages_unshared and pages_volatile to figure out how to throttle itself; the solution may be better and simpler. For example if the daemon sees high/unchanging ratio of pages_unshared to pages_sharing when no kicks have been seen and full_scans exceeds the number of scans required to merge, it should throttle down.

Yes. As you know right now we throttle down based on kick timeouts which may eventually converge to a stabilized pages_unshared to pages_sharing ratio. But I think watching for this ratio could help us throttling down quicker and consume less CPU. I'll add something along those lines to the throttling routine, thanks.

grahamwhaley commented 6 years ago

I agree with where @mcastelino is heading - I'm not sure we need a container or VM or possibly anything specific kick interface at all - as KSM solely processes MADV_MERGEABLE pages, and there are some (not many, but some) user space apps that also register such zones - I think we can get the info the throttler needs by watching the files in /sys/kernel/mm/ksm. We should I think be able to tell when a new zone or set of pages has been added, and as @mcastelino notes also how well matched the settings are to the set size and how much is left to scan or potentially merge. I'm not sure having any extra information beyond that (such as VM creation notifications) will help the algorithm - but, maybe it can, maybe we can predict there will be new zones arriving and 'ramp up' early. I suspect for phase1 we can just use the info in the sys ksm files. Then as a phase2 we could add an API and hook VM creates etc., and see if it makes a noticeable difference.

I also wonder if this work could live in the kernel itself - but, like many other algorithmic based daemons, it is probably better, at least initially, as a user space app. that will then have much more flexibility and turnaround time to allow algorithm investigation and tweaking. Later if we settle on a single algo, we could then consider if that could be a mode added to the kernel ksm code maybe.

sameo commented 6 years ago

@grahamwhaley

I suspect for phase1 we can just use the info in the sys ksm files

Let me try to understand: For phase 1 you're suggesting we don't even have a kick() interface for the input source to call?

@grahamwhaley @mcastelino I agree having a single Kick() entry point would be simpler. But I want to point out that moving from an event stream to a simple Kick() API call means we assume any input source is KSM aware, i.e. it implicitly knows when is a good time to kick the KSM daemon or not. Whereas having input sources sending events to the throttler keeps the entire logic and the kick or not kick decision inside the throttler. Input sources would just be dumb boxes reporting all sort of events to the throttler. And the throttler could then decide based on that event stream and the sysfs exposed KSM data if it's worth kicking a KSM scan or not.

@grahamwhaley

I also wonder if this work could live in the kernel itself

I think that's a pretty good point. Pushing strongly opinionated policies inside the kernel is usually a tough one, but it could be worth trying.

grahamwhaley commented 6 years ago

After some IRC chatter - one of the crux's of my theory of just use the ksm sysfs files was that we could file watch those files for changes - but, @sameo notes that we cannot use inotify events on sysfs, which sort of scuppers that somewhat as we don't want to poll, and hooking madvise call sounds rather unpleasant. At that point, yes, a kick interface that effectively either tells us to go check the ksm sysfs status (but I can conceive the kick would arrive before the ksm files have been updated or the madvise call made...) or hands us a stream of information that we then make decisions based on (including when to go look at the ksm files) sounds the most efficient way forwards.

We could start with a very small set of API events, that effectively tell us it is time to poll the KSM sysfs files.

sameo commented 6 years ago

I'm also researching how we could trace (perf events) madvise...

jodh-intel commented 6 years ago

Implemented by https://github.com/kata-containers/ksm-throttler.