Open simonis opened 11 months ago
Hi Volker,
Thanks for reporting this. We will take a look and reproduce it, but in the meantime I'd like to point out that this configuration:
Guest kernel: 5.19.8 Host kernel : 6.5.7 (Ubuntu 20.04)
is not supported. Would you be able to try and reproduce with a supported set of host/guest kernels? What we test with is guest x host = [4.14, 5.10] x [4.14, 5.10, 6.1]
(guest 6.1 might work too.
Also, to answer your question:
So to summarize, is the described behavior with deflate_on_oom=true a bug in the implementation or have I misunderstood the behavior of the ballooning device in the event of low memory in the guest?
this should work, and we have tests that indicate it does work, i.e. the balloon gets deflated, however we do not track the CPU time consumed to achieve this.
Hi @simonis, is the answer that @bchalios provided enough, does it resolve your issue, or is there anything else to investigate?
Sorry for the late answer @bchalios , @pb8o. I finally managed to run my experiments on a "supported" platform, but unfortunately the results are exactly the same.
Host: 6.1.72-96.166.amzn2023.x86_64
Guest: 6.1.74 (with the config from microvm-kernel-ci-x86_64-6.1.config plus CONFIG_IP_PNP=y
)
Firecracker: v1.6.0 and v1.7.0-dev ( from today 49db07b3)
So to summarize the problem: when I start Firecracker with a large ballooning device and deflate_on_oom: true
and then try to start a process in the guest which requires memory reserved by the balloon, the guest seems to hang and the Firecracker threads on the host will run at 100% CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
298704 ec2-user 20 0 1058260 85544 84964 R 57.9 0.0 5:45.06 fc_vcpu 1
298703 ec2-user 20 0 1058260 85544 84964 R 56.6 0.0 5:42.44 fc_vcpu 0
298698 ec2-user 20 0 1058260 85544 84964 R 26.2 0.0 2:33.69 firecracker
The guest itself is not really dead-locked, just extremely slow. I can ssh into it and and see the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
139 root 20 0 3150612 36748 32 S 112.5 3.6 13:00.63 jshell
44 root 20 0 0 0 0 R 100.0 0.0 10:13.17 kswapd0
The Java process I've started (i.e. jshell
) is starving because it doesn't get enough memory. But it doesn't run into a hard OOM like when I'm running with deflate_on_oom=false
. kswapd0
is running at 100% within the guest.
Querying the balloon metrics from the guest shows that the balloon slightly deflates itself, but this happens extremely slowly. E.g. initially we have something like this:
{
"target_pages": 230400,
"actual_pages": 228864,
"target_mib": 900,
"actual_mib": 894,
"swap_in": 0,
"swap_out": 0,
"major_faults": 7030729,
"minor_faults": 14225986,
"free_memory": 49930240,
"total_memory": 1033064448,
"available_memory": 0,
"disk_caches": 655360,
"hugetlb_allocations": 0,
"hugetlb_failures": 0
}
And after about 45 minutes we get to:
{
"target_pages": 230400,
"actual_pages": 180992,
"target_mib": 900,
"actual_mib": 707,
"swap_in": 0,
"swap_out": 0,
"major_faults": 18425420,
"minor_faults": 37668071,
"free_memory": 50409472,
"total_memory": 1033064448,
"available_memory": 0,
"disk_caches": 806912,
"hugetlb_allocations": 0,
"hugetlb_failures": 0
}
If I wait about 60 minutes, jshell
finally starts up and begins to be usable.
So ballooning is indeed "kind" of working, but not really practically usable. I would expect that the ballooning device deflates much more promptly in this case.
I did one more run to confirm the behavior and collect more numbers:
time | target_mib | actual_mib | free_memory | available_memory |
---|---|---|---|---|
17:05:44 | 900 | 900 | 63778816 | 0 |
17:24:26 | 900 | 893 | 50581504 | 0 |
17:36:34 | 900 | 879 | 51023872 | 0 |
17:44:46 | 900 | 870 | 54579200 | 0 |
17:55:46 | 900 | 835 | 67063808 | 0 |
18:06:14 | 900 | 797 | 50536448 | 0 |
18:13:16 | 900 | 637 | 55500800 | 0 |
jshell exit | 900 | 637 | 321597440 | 251809792 |
18:25:12 | 900 | 637 | 338468864 | 270401536 |
18:41:25 | 900 | 637 | 338210816 | 270143488 |
As you can see, it takes more than an hour until jshell
becomes responsive (somewhere between 18:06 and 18:13). It also looks like the deflation starts extremely slow but gets faster as time goes on.
The other interesting observation is that after jshell
exits, the balloon size doesn't inflate again, although its actual size is way below its target size and there's plenty of free memory. I would have expected that the balloon will automatically and continuously inflate if its size is below the target size and free memory is available. But nothing happens, there's no CPU usage in the Firecracker threads, neither in the host nor in the guest.
PS: these results were collected on a c5.metal
instance.
Hi @simonis, I've been taking a look at this.
The deflate on oom is indeed a slow process, this is managed by the balloon driver in the guest kernel not by Firecracker itself. It seem to try release as little memory as possible on deflate.
The balloon will indeed not re-inflate if it is deflated on oom if it has reached it's target size, this appears to by design in the driver. However, if the balloon not yet reached it's target size it will continue trying to inflate even after being deflated.
The high CPU usage while trying to reach it's target size again appears to be by design, the driver will aggressively try to allocate memory to reach it's target size.
Overall, it appears the driver is intending the balloon to be inflated for shorter periods of time to free up memory before an operation such as VM migration etc.
Hope this helps
After reading the Ballooning documentation my understanding of the
deflate_on_oom
is that if the parameter is set totrue
the ballooning device will be deflated automatically if a process in the guest requires memory pages which can not be otherwise provided:However, if I run Firecracker with e.g. 2 vCPUs,1gb of memory and a balloon device of 900mb:
..and then try to start a Java process in the guest with
-Xms800m -Xmx800m
(i.e. with a heap size of 800mb) the Java process in the guest will hang, Firecracker will use ~200% CPU time but the actual size occupied by the ballooning device in the guest will not change and remain at 900mb:Once I reset the target size of the ballooning device to 100mb, the Java process will become unblocked and start.
However, from the documentation of the
deflate_on_oom
option I would have expected that the guest kernel would deflate the ballooning device automatically, ifdeflate_on_oom=true
?If I run the same experiment with
deflate_on_oom=false
, I instantly get an out of memory error when I trying to start the Java process:which is what I would have expected.
Also, if I increase (i.e. inflate) the balloon to 900m again after I started the Java process, I start getting warnings from the ballooning driver (as documented):
..but the CPU usage again goes up to almost ~200%. Is this expected? I mean, the warnings are OK, but I wouldn't expect that Firecracker will burn all its CPU shares while trying to inflate the balloon?
So to summarize, is the described behavior with
deflate_on_oom=true
a bug in the implementation or have I misunderstood the behavior of the ballooning device in the event of low memory in the guest?PS: I've used the following kernel and FC versinons for the experiments: Guest kernel: 5.19.8 Host kernel : 6.5.7 (Ubuntu 20.04) Firecracker : 1.5.1 and 1.6.0-dev ( from today 036d9906)