Open beliaev-maksim opened 2 years ago
This thread was migrated from launchpad.net
Sorry for leaving this bug unaddressed!
Could you provide (or point to) some submissions made with Jammy using remote? I would like to check the duration of the jobs to see if this is something to do with some specific jobs, or if it's something related to the network, etc.
Archives of ongoing Checkbox sessions (the stuff in /var/tmp/checkbox-ng/sessions/) should help as well, since they contain the same information about duration of jobs and current job being run.
"Could you provide (or point to) some submissions made with Jammy using remote?" So far no, because none have completed, tehy get to the 48 hour timeout and end, the remote session disappears, the containers are reaped and that's it.So I never actually have a chance to get any logs. I can try to manually do one and see if I can capture logs that way.
You can at least see on my spreadsheet where each test run hung... it's always a long-term test like a stress test. https://docs.google.com/spreadsheets/d/1I26l91opzc0QuZWGlCbE4ZT63qMXv08k-eOywTsGOBI/
I'll see if I can get something useful otherwise... but really this should be pretty easy to replicate... just use remote to run the cert suite
one thing you can see right now though is that this ONLY affects running checkbox-remote against a Jammy host. This did not happen running against Focal w/ the Jammy kernel (HWE): https://certification.canonical.com/hardware/202205-30291/submission/277513/
Another thing you can see is this. A shorter testplan against Jammy:
https://certification.canonical.com/hardware/202101-28672/submission/277443
Looking at the submission, checkbox says the total test time was 4 hours 21 minutes, but actual execution time (the amount of time it took from the time testflinger launched until the test completed was 10 hours and 35 minutes:
Cleanup complete 10:34:59
Attached is a log created when I run testflinger (passing the console dump through tee)
Also should note this is only on the full SRU run where the stress tests happens. I also have tested Jammy with the Nvidia EGX provider I use for that effort and testing works fine.
And as noted, this works ok with Focal HWE (5.15 on Focal) so I don't think it's a kernel thing but rather something maybe changed in a library on Jammy.
FYI this should be very, very easy to reproduce.
Testing locally, install canonical-certification-server, launch a screen session and run:
/usr/bin/time -f %E test-regression
Then test using checkbox-remote locally over localhost:
/usr/bin/time -f %E checkbox-cli remote 127.0.0.1 /usr/bin/test-regression
When running Focal the times should be pretty similar.
When running Jammy, the "remote over localhost" run should be orders of magnitude longer than the "run directly without remote"
I have several machines STILL running this test using the "remote over localhost" scenario that have been going for nearly three full days. I am going to restart them and see if I can run a much smaller test suite just to see if I can see a faster way to get accurate times...
Did you manage to do a full run using checkbox remote and push the results to C3? The best would be, using the same device:
both uploaded to C3.
If you could also grab their respective journal logs, and any additional logs (like the testflinger you mentioned), that would be great.
So far, I have 2 guesses:
Still working on it. I tried using this:
checkbox-cli remote 127.0.0.1 test-regression
and that ran for a week before I killed it. (The same test ran for between 8-12 hours tops when run locally).
I thought I could shorten it a bit by just running the memory stress test so I created the attached launcher to just run the memory stress test and that's where I am now. The launcher does do automatic submission at the end and I run it like this:
time -f %E ./timing-test-launcher #for local runs
time -f %E checkbox-cli remote 127.0.0.1 ./timing-test-launcher #for remote runs
I noticed that there seems to be NO difference between running against localhost vs a network card, and since this is all be ing run local to the machine, that rules out any legit network funk between the SUT and the testflinger agent that normally runs checkbox remote tests.
Looking at the logs (syslog and journal) I noticed that checkbox-ng.service is being killed and restarted a few times by OOMKiller. That said, I suspect that's because some test cases deliberately trigger OOMKiller, and when it happens checkbox-ng.service just restarts and keeps doing its thing.
That said, I killed the test on one machine and modified the checkbox service using
OOMScoreAdjust=-1000
which is supposed to keep it from being reaped (at least google says that), and I restarted the test just to see what happens. The other machines are still running though (and have been running since last week. These I'm going to let run to completion to see what they actually do.
But I suspect the reaping of checkbox-ng.service is a red herring here, I just include that for completeness sake.
I have two other machines that are now running the remote test but using Focal+HWE to see if this is a kernel thing, or a Jammy+ userspace thing.
So in summary: Focal+HWE Local: runs as expected Focal+HWE Remote: still to do Jammy GA Local: runs as expected, completes memory stress in 3-4 hours Jammy GA Remote: in progress - already ran for several days on a test case that should have finished in a few hours.
Using the attached launcher, this is triggerable 100% of the time for me on every system I've tried, and all this points to your first guess. Note, of course github insists that every file have one of their limited number of extensions, even for files that don't need or want an extension, so remove the .txt before using... timing-test-launcher.txt
Have you ever seen a Jammy session complete at all when using Checkbox remote?
When you say
That said, I suspect that's because some test cases deliberately trigger OOMKiller, and when it happens checkbox-ng.service just restarts and keeps doing its thing.
do you actually have a view on what Checkbox is doing? (I mean, are you monitoring the run using a remote device?) If so, what is it showing?
I'm starting to wonder if the red herring you're mentioning could be the actual culprit.
For months, QA team had troubles when running memory stress tests on Jammy, because of how OOMkiller is being set. One of the QA engineer recently proposed a PR (#297) that was merged in the last release. This is the outcome of a long discussion about this problem.
Have you tried running everything but the memory stress tests?
Honestly I'm not sure if checkbox service being oomkilled is the issue (it could well be), only because very unscientificallly the instances in the journal/logs where I can see the service being killed and restarted seem to correspond with specific test cases that expect to have OOM Killer triggered, and so far, I would expect to see those messages far more than just 4 or 5 times. That said, I also have tried by using OOMScoreAdjust but it's still being reaped, so that isn't a fix. Maybe on that one machine next I'll have to try what the QA team suggested (I was loosely following that thread after commenting on it a bit)
FWIW, I'm at 16 restarts now on this one machine where I'm watching the logs:
Mar 7 17:30:21 barbos kernel: [76196.638862] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=checkbox-ng.service,mems_allowed=0,global_oom,task_memcg=/system.slice/checkbox-ng.service,task=stress-ng-mlock,pid=44164,uid=0
Mar 7 17:30:22 barbos kernel: [76196.896971] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=checkbox-ng.service,mems_allowed=0,global_oom,task_memcg=/system.slice/checkbox-ng.service,task=stress-ng-mlock,pid=43937,uid=0
Mar 7 17:30:23 barbos kernel: [76197.849642] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=checkbox-ng.service,mems_allowed=0,global_oom,task_memcg=/system.slice/checkbox-ng.service,task=stress-ng-mlock,pid=44060,uid=0
Mar 7 17:30:23 barbos systemd[1]: checkbox-ng.service: A process of this unit has been killed by the OOM killer.
Mar 7 17:30:44 barbos systemd[1]: checkbox-ng.service: Failed with result 'oom-kill'.
Mar 7 17:30:44 barbos systemd[1]: checkbox-ng.service: Consumed 4d 10h 23min 14.295s CPU time.
Mar 7 17:30:44 barbos systemd[1]: checkbox-ng.service: Scheduled restart job, restart counter is at 16.
Mar 7 17:30:44 barbos systemd[1]: checkbox-ng.service: Consumed 4d 10h 23min 14.295s CPU time.
Mar 7 17:30:46 barbos checkbox-ng.service[44171]: normal_user not supplied via config(s).
Mar 7 17:30:46 barbos checkbox-ng.service[44171]: Using `ubuntu` user
I had also thought to maybe try creating a testplan that runs everything BUT the stress tests to see what happens too.
As for monitoring, the one I'm getting logs on I'm ssh'd in and running the test in a screen session, and in a different terminal I'm also ssh'ed in and just doing a `tail -f /var/log/syslog |grep checkbox-ng.service` to see where the restarts are happening.
OK, so I had Focal HWE (5.15) running on a machine and finally just killed the test run. The total time to that point was 170 hours. The local run of hte same test case was only 2 hours. so this is related to the kernel and remote, unfortunately, the fix for the OOMD config doesn't work here anyway as the files they're looking for don't exist on Focal. (not on my test machine at least). I can try it sometime on a Jammy machine to see what happens though, but as we still have to test focal, it's not a viable solution even if it does help on Jammy systems.
So now, to summarise, this bug prevents me from using checkbox remote on anything with 5.15.
I've ran the following test:
sudo snap install checkbox22 && sudo snap install checkbox --classic
)checkbox-cli remote <DUT IP>
memory_stress_ng
watch systemctl snap.checkbox.service.service
and sudo journalctl -f
From the remote controller, I can see the usual session running, launching resource jobs, etc.
Then, it starts the memory stress-ng job. Each stressor goes, one by one. It takes a while. After around 3+ hours, the bigheap
stressor triggers an oomd-killer which kills the checkbox service on the DUT. As I was expecting, the session is lost and I am prompted with a new "Select Test Plan" screen. This is why QA team did a fix.
I'm now going to try on a 20.04 machine.
Used checkbox remote to connect to a focal laptop with kernel 5.15.0-67-generic, and ran the com.canonical.certification::20.04-server-full
test plan, selecting only memory/memory_stress_ng
:
During the test, I can see this with systemctl status snap.checkbox.service.service
:
● snap.checkbox.service.service - Service for snap application checkbox.service
Loaded: loaded (/etc/systemd/system/snap.checkbox.service.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2023-03-16 14:58:18 CST; 16min ago
Main PID: 80976 (python3)
Tasks: 19 (limit: 14068)
Memory: 3.6G
CGroup: /system.slice/snap.checkbox.service.service
├─80976 python3 /snap/checkbox20/current/bin/checkbox-cli service
├─84689 systemd-inhibit stress_ng_test.py memory
├─84693 python3 /tmp/nest-fkzeb7t7.e1013b014ee4ce382e5a1291c27c2890a7088a1741917d4aa1a85269eb1399fe/stress_ng_test.py memory
├─85172 stress-ng --aggressive --verify --timeout 300 --context 0
├─85173 stress-ng-context [run]
├─85174 stress-ng-context [run]
├─85175 stress-ng-context [run]
├─85176 stress-ng-context [run]
├─85177 stress-ng-context [run]
├─85178 stress-ng-context [run]
├─85179 stress-ng-context [run]
├─85180 stress-ng-context [run]
└─85181 stress-ng-ignite [periodic]
At some point, the laptop became really unresponsive and I thought it was dead, but when I came back later, the test had finished. So things look OK on focal, using Checkbox remote. Focal does not use systemd-oomd...
I've also ran the same test as in a previous comment on the same laptop, but this time, in local mode. checkbox-cli was killed after 150 minutes of running memory/memory_stress_ng
:
Running 22.04 Server Full without memory/memory_stress_ng
on the same device as in this comment using the same version of checkbox remote (2.1), I can complete the run. Submission is available here..
During the run, the cgroup shown in systemctl status snap.checkbox.service.service
was the same:
CGroup: /system.slice/snap.checkbox.service.service
└─226680 python3 /snap/checkbox22/current/bin/checkbox-cli service
So it probably has to do with:
memory/memory_stress_ng
Maybe it could be worth trying the solution brought by QA team in #297 in the server test plan and see if that improves the situation. @bladernr if you have time to do that today, please go ahead, otherwise I'll try that on Monday.
Edit: I've launched the 22.04 Desktop Stress Test Plan which includes the fix made in #297 on the same device. We'll see in a few hours what happens...
Argh, of course the session hit #22 because there are some poweroff and reboot stress tests, so I lost the session.
Restarting the same without any poweroff/reboot stress tests.
Unfortunately, I also hit #22 when running memory_stress_ng
, when the bigheap
stressor is in use:
just FYI in case it was missed, this is the launcher i'm using:
https://github.com/canonical/checkbox/files/10902924/timing-test-launcher.txt
Also, it's stress-ng in general, I think. I'm just using memory-stress as an arbitrary choice (and it was the shortest run one on systems with smaller footprint since the runtime depends on the amount of RAM installed, and isn't necessarily dependent on the number of CPU cores) FWIW here's one of the small systems (small older server)
ubuntu@mayapple:~$ ps -p 2520196 -o pid,etime,cmd PID ELAPSED CMD 2520196 13-03:57:01 /usr/bin/time -f %E checkbox-cli remote 127.0.0.1 ./timing-test-launcher
GAH.…. It is definitely because of OOM killer... I finally noticed the pattern in the syslog entries...
Mar 19 00:24:16 mayapple systemd[1]: checkbox-ng.service: Consumed 13h 21min 11.802s CPU time.
Mar 19 00:24:16 mayapple systemd[1]: system.slice: A process of this unit has been killed by the OOM killer.
Mar 19 00:24:17 mayapple systemd[1]: checkbox-ng.service: Scheduled restart job, restart counter is at 218.
Mar 19 00:24:17 mayapple systemd[1]: Stopped Checkbox Remote Service.
Mar 19 00:24:17 mayapple systemd[1]: checkbox-ng.service: Consumed 13h 21min 11.802s CPU time.
Mar 19 00:24:18 mayapple systemd[1]: Started Checkbox Remote Service.
Mar 19 00:24:24 mayapple checkbox-ng.service[2622755]: normal_user not supplied via config(s).
Mar 19 00:24:24 mayapple checkbox-ng.service[2622755]: Using ubuntu
user
Mar 19 00:24:32 mayapple stress-ng: invoked with 'stress-ng --aggressive --verify --timeout 300 --bsearch 0' by user 0 'root'
Mar 19 00:24:32 mayapple stress-ng: system: 'mayapple' Linux 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64
Mar 19 00:24:32 mayapple stress-ng: memory (MB): total 5930.07, free 5301.74, shared 3.06, buffer 10.57, swap 4096.00, free swap 3945.07
Then just an hour or so later:
Mar 19 01:40:00 mayapple systemd[1]: checkbox-ng.service: A process of this unit has been killed by the OOM killer.
Mar 19 01:40:00 mayapple systemd[1]: checkbox-ng.service: Killing process 2622755 (checkbox-cli) with signal SIGKILL.
Mar 19 01:40:00 mayapple systemd[1]: checkbox-ng.service: Failed with result 'oom-kill'.
Mar 19 01:40:00 mayapple systemd[1]: checkbox-ng.service: Consumed 13h 21min 19.980s CPU time.
Mar 19 01:40:00 mayapple systemd[1]: system.slice: A process of this unit has been killed by the OOM killer.
Mar 19 01:40:02 mayapple systemd[1]: checkbox-ng.service: Scheduled restart job, restart counter is at 219.
Mar 19 01:40:02 mayapple systemd[1]: Stopped Checkbox Remote Service.
Mar 19 01:40:02 mayapple systemd[1]: checkbox-ng.service: Consumed 13h 21min 19.980s CPU time.
Mar 19 01:40:03 mayapple systemd[1]: Started Checkbox Remote Service.
Mar 19 01:40:10 mayapple checkbox-ng.service[2623293]: normal_user not supplied via config(s).
Mar 19 01:40:10 mayapple checkbox-ng.service[2623293]: Using ubuntu
user
Mar 19 01:40:18 mayapple stress-ng: invoked with 'stress-ng --aggressive --verify --timeout 300 --bsearch 0' by user 0 'root'
Mar 19 01:40:18 mayapple stress-ng: system: 'mayapple' Linux 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64
Mar 19 01:40:18 mayapple stress-ng: memory (MB): total 5930.07, free 5287.66, shared 3.03, buffer 10.55, swap 4096.00, free swap 3946.59
and on and on... SO... OOMkiller is killing checkbox, whcih restarts and starts running the stress test again from scratch since it's not stateful. THAT is why this is going on and on, it's in an unending loop.
So ideas:
I think this may deserve also filing bugs against systemd-oomd perhaps...
The first one is probably the path of least resistance...
Also, I hate that it has taken me this long to finally see that pattern... my eyes have been going blurry every time I look at the logs and output from these runs (that and they take so very long that I get lost in context switching...).
Figure out how to fix the systemd-oomd crap on server
@bladernr just to confirm, the logs in your comment come from a 20.04 server running checkbox remote?
The fix made by QA team a few weeks ago would not cover 20.04 as you mentioned earlier, but it might be worth trying it with 22.04.
Come to think of it, the stress tests should have a soft dependency on stress/store_and_change_oomd_config
(that is, their job should use after: stress/store_and_change_oomd_config
to execute this job first and proceed, regardless of the outcome of stress/store_and_change_oomd_config
). This way the jobs would still work on 20.04.
So, to summarize:
I'm not sure if the jobs stress/store_and_change_oomd_config
would work on a server image, from what I can recall there's no /usr/lib/systemd/system/user@.service.d/10-oomd-user-service-defaults.conf
exists in the jammy server image
One experiment I think you can try is to run stress-ng-automated test plan and show us the result, there's a change in the test plan recently and it's breaking down the stress-ng jobs from class based jobs to stressor based jobs, by using that way we've already observed an improvement on preventing the checkbox running on low profile IoT devices being killed by kernel oom-killer (not the systemd-oomd in desktop image since we're running on server and core images)
FYI on server we already run on stressor based jobs and has been the case since the beginning. The script that runs all the stress-ng jobs for memory, CPU, and storage all runs stressors either individually/sequentially in memory and storage, or in parallel in CPU stress. But we don't really call the "class" stressor groups anyway.
I did find time to try the code in those jobs that modify systemd-oomd and confirm that the necessary files are not present on servers... both 20.04 and 22.04 I'm currently running the following on 4 different servers: Prunus: tryign 20.04+HWE to verify this happens there too mayapple: 22.04 GA and modifying vm.oom_kill_allocating_task to see if changing that to non-zero will help here makrutlime: 20.04 GA because I need to see what the config differences are between 5.4 and 5.15 (maybe the vm.oom* options are different? gurley: 22.04 - gonna try a suggestion from Server to see if installing systemd-oomd on server and then running those oomd modification jobs has any effect.
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/CHECKBOX-1112.
This message was autogenerated
This issue was migrated from https://bugs.launchpad.net/checkbox-ng/+bug/1980288
Summary
Description
I've spent weeks now doing repeated testing to get the data needed to highlight this.
When running SRU tests against Jammy, it takes upwards of 2 days per run, whereas for any other Ubuntu release the test takes only a few hours.
This is not strictly an OS issue, as running the SRU tests locally using Jammy takes significantly less time than running it via checkbox remote (as done with testflinger). Because of this, I'm practically unable to do SRU testing on Jammy, as each SRU cycle potentially means testing 5 different kernels on each machine.
I have a spreadsheet with collated timing data for every current LTS kernel using the same SRU tests. I can share that if necessary, but the tl;dr of that spreadsheet is:
Bionic 11:23:49 Bionic HWE 9:45:47 Focal 9:37:33 Focal HWE 10:46:38 Jammy TF 48:17:45 Jammy Local 17:23:34
As you can see, Bionic and Focal run to completion in less than 12 hours on average.
Jammy TF is Jammy run via Testflinger using checkbox-remote. in EVERY instance, it hits the global timeout because of something going on that is specific to checkbox-remote and Jammy.
Jammy Local is instances where I did a deployment of jammy, then logged into the machines directly and ran the same SRU suite by hand in a screen session. That average DOES take longer on jammy, so a component of this is related to Jammy and I'll start tracking this down later... but for now, the fact that checkbox-remote exacerbates this to the point where testing is never completed is a huge problem.
I logged times across 27 machines to get this data. Some systems I didn't get times for due to other issues with the machines and MAAS during deployment, but those are all deployment issues, not issues once deployed and running tests.
Attachments
doubletusk-jammy-egx-2022-08-19_1209.log
Tags: []