`deflate_on_oom` doesn't seem to work as expected/documented

simonis commented 11 months ago

After reading the Ballooning documentation my understanding of the deflate_on_oom is that if the parameter is set to true the ballooning device will be deflated automatically if a process in the guest requires memory pages which can not be otherwise provided:

deflate_on_oom: if this is set to true and a guest process wants to allocate some memory which would make the guest enter an out-of-memory state, the kernel will take some pages from the balloon and give them to said process

However, if I run Firecracker with e.g. 2 vCPUs,1gb of memory and a balloon device of 900mb:

{
    "target_pages": 230400,
    "actual_pages": 230400,
    "target_mib": 900,
    "actual_mib": 900,
    "swap_in": 0,
    "swap_out": 0,
    "major_faults": 92,
    "minor_faults": 3103,
    "free_memory": 66572288,
    "total_memory": 84398080,
    "available_memory": 0,
    "disk_caches": 151552,
    "hugetlb_allocations": 0,
    "hugetlb_failures": 0
}

..and then try to start a Java process in the guest with -Xms800m -Xmx800m (i.e. with a heap size of 800mb) the Java process in the guest will hang, Firecracker will use ~200% CPU time but the actual size occupied by the ballooning device in the guest will not change and remain at 900mb:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2552550 xxxxxxxx  20   0 1063092  82992  82096 R  99,9   0,3   3:44.76 fc_vcpu 1
2552544 xxxxxxxx  20   0 1063092  82992  82096 R  90,9   0,3   3:12.17 firecracker
2552549 xxxxxxxx  20   0 1063092  82992  82096 S  25,0   0,3   0:43.47 fc_vcpu 0

Once I reset the target size of the ballooning device to 100mb, the Java process will become unblocked and start.

However, from the documentation of the deflate_on_oom option I would have expected that the guest kernel would deflate the ballooning device automatically, if deflate_on_oom=true?

If I run the same experiment with deflate_on_oom=false, I instantly get an out of memory error when I trying to start the Java process:

penJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000ce000000, 279576576, 0) failed; error='Not enough space' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 279576576 bytes for committing reserved memory.

which is what I would have expected.

Also, if I increase (i.e. inflate) the balloon to 900m again after I started the Java process, I start getting warnings from the ballooning driver (as documented):

[  282.580254] virtio_balloon virtio0: Out of puff! Can't get 1 pages

..but the CPU usage again goes up to almost ~200%. Is this expected? I mean, the warnings are OK, but I wouldn't expect that Firecracker will burn all its CPU shares while trying to inflate the balloon?

So to summarize, is the described behavior with deflate_on_oom=true a bug in the implementation or have I misunderstood the behavior of the ballooning device in the event of low memory in the guest?

PS: I've used the following kernel and FC versinons for the experiments: Guest kernel: 5.19.8 Host kernel : 6.5.7 (Ubuntu 20.04) Firecracker : 1.5.1 and 1.6.0-dev ( from today 036d9906)

bchalios commented 11 months ago

Hi Volker,

Thanks for reporting this. We will take a look and reproduce it, but in the meantime I'd like to point out that this configuration:

Guest kernel: 5.19.8 Host kernel : 6.5.7 (Ubuntu 20.04)

is not supported. Would you be able to try and reproduce with a supported set of host/guest kernels? What we test with is guest x host = [4.14, 5.10] x [4.14, 5.10, 6.1] (guest 6.1 might work too.

bchalios commented 11 months ago

Also, to answer your question:

So to summarize, is the described behavior with deflate_on_oom=true a bug in the implementation or have I misunderstood the behavior of the ballooning device in the event of low memory in the guest?

this should work, and we have tests that indicate it does work, i.e. the balloon gets deflated, however we do not track the CPU time consumed to achieve this.

pb8o commented 10 months ago

Hi @simonis, is the answer that @bchalios provided enough, does it resolve your issue, or is there anything else to investigate?

simonis commented 9 months ago

Sorry for the late answer @bchalios , @pb8o. I finally managed to run my experiments on a "supported" platform, but unfortunately the results are exactly the same.

Host: 6.1.72-96.166.amzn2023.x86_64 Guest: 6.1.74 (with the config from microvm-kernel-ci-x86_64-6.1.config plus CONFIG_IP_PNP=y) Firecracker: v1.6.0 and v1.7.0-dev ( from today 49db07b3)

So to summarize the problem: when I start Firecracker with a large ballooning device and deflate_on_oom: true and then try to start a process in the guest which requires memory reserved by the balloon, the guest seems to hang and the Firecracker threads on the host will run at 100% CPU:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                           
 298704 ec2-user  20   0 1058260  85544  84964 R  57.9   0.0   5:45.06 fc_vcpu 1                                         
 298703 ec2-user  20   0 1058260  85544  84964 R  56.6   0.0   5:42.44 fc_vcpu 0                                         
 298698 ec2-user  20   0 1058260  85544  84964 R  26.2   0.0   2:33.69 firecracker

The guest itself is not really dead-locked, just extremely slow. I can ssh into it and and see the following:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                             
  139 root      20   0 3150612  36748     32 S 112.5   3.6  13:00.63 jshell                                              
   44 root      20   0       0      0      0 R 100.0   0.0  10:13.17 kswapd0

The Java process I've started (i.e. jshell) is starving because it doesn't get enough memory. But it doesn't run into a hard OOM like when I'm running with deflate_on_oom=false. kswapd0 is running at 100% within the guest.

Querying the balloon metrics from the guest shows that the balloon slightly deflates itself, but this happens extremely slowly. E.g. initially we have something like this:

{
    "target_pages": 230400,
    "actual_pages": 228864,
    "target_mib": 900,
    "actual_mib": 894,
    "swap_in": 0,
    "swap_out": 0,
    "major_faults": 7030729,
    "minor_faults": 14225986,
    "free_memory": 49930240,
    "total_memory": 1033064448,
    "available_memory": 0,
    "disk_caches": 655360,
    "hugetlb_allocations": 0,
    "hugetlb_failures": 0
}

And after about 45 minutes we get to:

{
    "target_pages": 230400,
    "actual_pages": 180992,
    "target_mib": 900,
    "actual_mib": 707,
    "swap_in": 0,
    "swap_out": 0,
    "major_faults": 18425420,
    "minor_faults": 37668071,
    "free_memory": 50409472,
    "total_memory": 1033064448,
    "available_memory": 0,
    "disk_caches": 806912,
    "hugetlb_allocations": 0,
    "hugetlb_failures": 0
}

If I wait about 60 minutes, jshell finally starts up and begins to be usable.

So ballooning is indeed "kind" of working, but not really practically usable. I would expect that the ballooning device deflates much more promptly in this case.

simonis commented 9 months ago

I did one more run to confirm the behavior and collect more numbers:

time	target_mib	actual_mib	free_memory	available_memory
17:05:44	900	900	63778816	0
17:24:26	900	893	50581504	0
17:36:34	900	879	51023872	0
17:44:46	900	870	54579200	0
17:55:46	900	835	67063808	0
18:06:14	900	797	50536448	0
18:13:16	900	637	55500800	0
jshell exit	900	637	321597440	251809792
18:25:12	900	637	338468864	270401536
18:41:25	900	637	338210816	270143488

As you can see, it takes more than an hour until jshell becomes responsive (somewhere between 18:06 and 18:13). It also looks like the deflation starts extremely slow but gets faster as time goes on.

The other interesting observation is that after jshell exits, the balloon size doesn't inflate again, although its actual size is way below its target size and there's plenty of free memory. I would have expected that the balloon will automatically and continuously inflate if its size is below the target size and free memory is available. But nothing happens, there's no CPU usage in the Firecracker threads, neither in the host nor in the guest.

PS: these results were collected on a c5.metal instance.

JackThomson2 commented 2 days ago

Hi @simonis, I've been taking a look at this.

The deflate on oom is indeed a slow process, this is managed by the balloon driver in the guest kernel not by Firecracker itself. It seem to try release as little memory as possible on deflate.

The balloon will indeed not re-inflate if it is deflated on oom if it has reached it's target size, this appears to by design in the driver. However, if the balloon not yet reached it's target size it will continue trying to inflate even after being deflated.

The high CPU usage while trying to reach it's target size again appears to be by design, the driver will aggressively try to allocate memory to reach it's target size.

Overall, it appears the driver is intending the balloon to be inflated for shorter periods of time to free up memory before an operation such as VM migration etc.

Hope this helps

firecracker-microvm / firecracker

`deflate_on_oom` doesn't seem to work as expected/documented #4324