Closed rootkovska closed 5 years ago
BTW, a workaround suggested by Marek to increase the dom0-boost
param in /etc/qubes/qmemman.conf
does not work.
FWIW, I double checked that my qmemman got restarted after I increased the dom0-boost param to 650 MB, still getting OOM around once a day... :( I know @marmarek also experienced this, but I'm really surprised why not anyone else reports this problem? For years I've never had any stability problems related to OOM (nor anything else), and then suddenly recently this happens all the time...
On Tue, Sep 05, 2017 at 09:13:51AM -0700, Joanna Rutkowska wrote:
Qubes OS version (e.g.,
R3.2
):In the last weeks I've been experiencing frequent Out Of Memory (OOM) killer actions in dom0, likely due to incorrect memory reporting by the dom0 kernel. This results in X in dom0 getting restarted. In order to recover the system, the user must:
- Log in again (this is easy, since X defaults to the login page after it gets restarted),
- Open a terminal and manually remove the
/var/run/shm.id
file,
Those two steps must be performed from text console, not GUI (X).
- Log out and log in again (i.e. restart the X again). (After this all the gui daemons connect back to all the open windows, so no work is lost and
Another problem (Qubes-related) is the need to manually remove the
shm.id
file and restart the X. This should be done automatically, so that the user could just log in and have all the windows automatically restored (guid connected back).
This is already fixed in Qubes 4.0. But the fix (or actually - the update procedure) is too fragile to backport it to 3.2, because it involve renaming /var/run/shm.id - which is referenced from various places. Normally X server remove this file on its termination, but it doesn't have a chance for it, when its killed with SIGKILL (and this is what OOM killer does). The reason to not override /var/run/shm.id at X server startup, is that it may belong another (still running!) X server instance. This is the case for clicking on "switch user" option, using default lightdm screen locker, and many more. Overriding /var/run/shm.id in that case will mean loosing GUI access to all AppVMs.
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
One of mine crashes looks like this: https://gist.github.com/marmarek/1215529d91f44254d38e0f30d8e084de Interesting part:
Sep 03 17:34:20 dom0 kernel: xen-blkback: backend/vbd/117/51712: using 4 queues, protocol 1 (x86_64-abi) persistent grants
Sep 03 17:34:20 dom0 kernel: xen-blkback: backend/vbd/117/51728: using 4 queues, protocol 1 (x86_64-abi) persistent grants
Sep 03 17:34:20 dom0 kernel: xenwatch invoked oom-killer: gfp_mask=0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Sep 03 17:34:20 dom0 kernel: xenwatch cpuset=/ mems_allowed=0
(...)
Sep 03 17:34:20 dom0 kernel: Call Trace:
Sep 03 17:34:20 dom0 kernel: [<ffffffff813e45e3>] dump_stack+0x63/0x90
Sep 03 17:34:20 dom0 kernel: [<ffffffff8123fd7d>] dump_header+0x7b/0x1fd
Sep 03 17:34:20 dom0 kernel: [<ffffffff817daf9a>] ? _raw_spin_unlock_irqrestore+0x1a/0x20
Sep 03 17:34:20 dom0 kernel: [<ffffffff811b6939>] oom_kill_process+0x219/0x3e0
Sep 03 17:34:20 dom0 kernel: [<ffffffff811b6e33>] out_of_memory+0x143/0x4d0
Sep 03 17:34:20 dom0 kernel: [<ffffffff811bc79d>] __alloc_pages_slowpath+0xbbd/0xc10
Sep 03 17:34:20 dom0 kernel: [<ffffffff811bca46>] __alloc_pages_nodemask+0x256/0x2c0
Sep 03 17:34:20 dom0 kernel: [<ffffffff81211105>] alloc_pages_current+0x95/0x140
Sep 03 17:34:20 dom0 kernel: [<ffffffff811dd018>] kmalloc_order+0x18/0x40
Sep 03 17:34:20 dom0 kernel: [<ffffffff811dd064>] kmalloc_order_trace+0x24/0xa0
Sep 03 17:34:20 dom0 kernel: [<ffffffffc000e49f>] ? read_per_ring_refs+0x30f/0x400 [xen_blkback]
Sep 03 17:34:20 dom0 kernel: [<ffffffffc000e2a4>] read_per_ring_refs+0x114/0x400 [xen_blkback]
Sep 03 17:34:20 dom0 kernel: [<ffffffff813f1209>] ? snprintf+0x49/0x60
Sep 03 17:34:20 dom0 kernel: [<ffffffffc000e8e1>] connect_ring.isra.7+0x351/0x4c0 [xen_blkback]
Sep 03 17:34:20 dom0 kernel: [<ffffffffc000f23a>] frontend_changed+0xda/0x140 [xen_blkback]
Sep 03 17:34:20 dom0 kernel: [<ffffffff814ceb83>] xenbus_otherend_changed+0x103/0x110
I think my original diagnosis was wrong - OOM happens some time after qmemman takes memory from dom0, not during that operation. Looking at call trace it happens when xen-blkfront in the VM connects to xen-blkback in dom0. Each of the OOM in my logs happens exactly at the same place. So, maybe it's about xen-blkback in 4.9.x kernel being more memory hungry. Or there is some memory leak. This also explains why qmemman config modification does not help.
@rootkovska can you check if it's xenwatch
triggering OOM?
Also, what was the system uptime at the time of OOM, counted in VM startups - in other word, what is max XID of running VMs? In my case minimal number at which OOM happened was 52.
The most recent OOM:
Sep 07 11:03:12 dom0 root[5474]: /etc/xen/scripts/block-snapshot: add XENBUS_PATH=backend/vbd/38/51712
Sep 07 11:03:12 dom0 root[5491]: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/38/51728
Sep 07 11:03:12 dom0 root[5516]: /etc/xen/scripts/block-snapshot: Writing backend/vbd/38/51712/node /dev/mapper/snapshot-fd01:22415761-fd01:22415856 to x
Sep 07 11:03:12 dom0 root[5518]: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/38/51744
Sep 07 11:03:12 dom0 root[5529]: /etc/xen/scripts/block-snapshot: Writing backend/vbd/38/51712/physical-device fd:4 to xenstore.
Sep 07 11:03:13 dom0 root[5542]: /etc/xen/scripts/block-snapshot: Writing backend/vbd/38/51712/hotplug-status connected to xenstore.
Sep 07 11:03:13 dom0 root[5555]: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/38/51760
Sep 07 11:03:13 dom0 root[5625]: /etc/xen/scripts/block: Writing backend/vbd/38/51728/node /dev/loop24 to xenstore.
Sep 07 11:03:13 dom0 root[5635]: /etc/xen/scripts/block: Writing backend/vbd/38/51728/physical-device 7:18 to xenstore.
Sep 07 11:03:13 dom0 root[5642]: /etc/xen/scripts/block: Writing backend/vbd/38/51728/hotplug-status connected to xenstore.
Sep 07 11:03:13 dom0 root[5673]: /etc/xen/scripts/block: Writing backend/vbd/38/51744/node /dev/loop25 to xenstore.
Sep 07 11:03:13 dom0 root[5684]: /etc/xen/scripts/block: Writing backend/vbd/38/51744/physical-device 7:19 to xenstore.
Sep 07 11:03:13 dom0 root[5690]: /etc/xen/scripts/block: Writing backend/vbd/38/51744/hotplug-status connected to xenstore.
Sep 07 11:03:14 dom0 root[5859]: /etc/xen/scripts/block: Writing backend/vbd/38/51760/node /dev/loop3 to xenstore.
Sep 07 11:03:14 dom0 root[5863]: /etc/xen/scripts/block: Writing backend/vbd/38/51760/physical-device 7:3 to xenstore.
Sep 07 11:03:14 dom0 root[5865]: /etc/xen/scripts/block: Writing backend/vbd/38/51760/hotplug-status connected to xenstore.
Sep 07 11:03:16 dom0 kernel: xen-blkback: backend/vbd/38/51712: using 4 queues, protocol 1 (x86_64-abi) persistent grants
Sep 07 11:03:16 dom0 kernel: xen-blkback: backend/vbd/38/51728: using 4 queues, protocol 1 (x86_64-abi) persistent grants
Sep 07 11:03:16 dom0 kernel: xen-blkback: backend/vbd/38/51744: using 4 queues, protocol 1 (x86_64-abi) persistent grants
Sep 07 11:03:17 dom0 kernel: xen-blkback: backend/vbd/38/51760: using 4 queues, protocol 1 (x86_64-abi) persistent grants
Sep 07 11:03:27 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Sep 07 11:03:27 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Sep 07 11:03:27 dom0 kernel: CPU: 2 PID: 7916 Comm: Xorg Tainted: G U 4.9.45-21.pvops.qubes.x86_64 #1
Sep 07 11:03:27 dom0 kernel: Hardware name: Dell Inc. XPS 13 9343/0310JH, BIOS A04 05/15/2015
The previous one:
Sep 04 13:05:37 dom0 qmemman.systemstate[1756]: dom '0' still hold more memory than have assigned (2619953152 > 2598224688)
Sep 04 13:05:39 dom0 qmemman.systemstate[1756]: dom '0' didnt react to memory request (holds 2619953152, requested balloon down to 2558453125)
Sep 04 13:06:05 dom0 kernel: audit_printk_skb: 15 callbacks suppressed
Sep 04 13:07:22 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Sep 04 13:07:22 dom0 kernel: CPU: 3 PID: 3112 Comm: Xorg Tainted: G U W 4.9.45-21.pvops.qubes.x86_64 #1
Sep 04 13:07:22 dom0 kernel: Hardware name: Dell Inc. XPS 13 9343/0310JH, BIOS A04 05/15/2015
I've tried to reproduce it on a testing machine, but failed. Either starting large number of VMs (simultaneously or sequentially) does not trigger it there. Also inducing large memory usage in some of them does not trigger the problem. I think having xl list
output (can be with obscured VM names) from the crash time would be useful.
@rootkovska do you have call trace somewhere there? But I see it's different on your system - "Xorg invoked oom-killer" vs "xenwatch invoked oom-killer". In any case, it is just after xen-blkfront connects to xen-blkback.
Just for say I have often the same problem on my T450s , with many VM running or less sometimes. Happy to read the workaround for recover my windows without rebooting. I will try this.
On ven. 8 sept. 2017 à 12:35 Marek Marczykowski-Górecki < notifications@github.com> wrote:
@rootkovska https://github.com/rootkovska do you have call trace somewhere there? But I see it's different on your system - "Xorg invoked oom-killer" vs "xenwatch invoked oom-killer". In any case, it is just after xen-blkfront connects to xen-blkback.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/QubesOS/qubes-issues/issues/3079#issuecomment-328069516, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfrq54GHQjws73zZLT2XDBkDYd0E5-nks5sgRhtgaJpZM4PNL22 .
Here's some more for your debugging pleasure :)
Sep 07 11:03:27 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Sep 07 11:03:27 dom0 kernel: CPU: 2 PID: 7916 Comm: Xorg Tainted: G U 4.9.45-21.pvops.qubes.x86_64 #1
Sep 07 11:03:27 dom0 kernel: Hardware name: Dell Inc. XPS 13 9343/0310JH, BIOS A04 05/15/2015
Sep 07 11:03:27 dom0 kernel: ffffc90003883500 ffffffff813e62c3 ffffc900038836d0 ffff88009e6fd880
Sep 07 11:03:27 dom0 kernel: ffffc90003883578 ffffffff812414bd 0000000000000003 0000000000000000
Sep 07 11:03:27 dom0 kernel: 0000000000000000 0000000000000000 ffff88011b249b20 0000000000000001
Sep 07 11:03:27 dom0 kernel: Call Trace:
Sep 07 11:03:27 dom0 kernel: [<ffffffff813e62c3>] dump_stack+0x63/0x90
Sep 07 11:03:27 dom0 kernel: [<ffffffff812414bd>] dump_header+0x7b/0x1fd
Sep 07 11:03:27 dom0 kernel: [<ffffffff817de41a>] ? _raw_spin_unlock_irqrestore+0x1a/0x20
Sep 07 11:03:27 dom0 kernel: [<ffffffff811b7ef9>] oom_kill_process+0x219/0x3e0
Sep 07 11:03:27 dom0 kernel: [<ffffffff811b83f3>] out_of_memory+0x143/0x4d0
Sep 07 11:03:27 dom0 kernel: [<ffffffff811bdd5d>] __alloc_pages_slowpath+0xbbd/0xc10
Sep 07 11:03:27 dom0 kernel: [<ffffffff811be006>] __alloc_pages_nodemask+0x256/0x2c0
Sep 07 11:03:27 dom0 kernel: [<ffffffff81212865>] alloc_pages_current+0x95/0x140
Sep 07 11:03:27 dom0 kernel: [<ffffffff811de5e8>] kmalloc_order+0x18/0x40
Sep 07 11:03:27 dom0 kernel: [<ffffffff811de634>] kmalloc_order_trace+0x24/0xa0
Sep 07 11:03:27 dom0 kernel: [<ffffffff8121fe73>] __kmalloc+0x1f3/0x210
Sep 07 11:03:27 dom0 kernel: [<ffffffffc019fb9e>] ? alloc_gen8_temp_bitmaps+0x2e/0x80 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc019fbb7>] alloc_gen8_temp_bitmaps+0x47/0x80 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc019fc90>] gen8_alloc_va_range_3lvl+0xa0/0x9b0 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffff814da513>] ? xen_swiotlb_map_sg_attrs+0x1f3/0x2e0
Sep 07 11:03:27 dom0 kernel: [<ffffffff814da4d3>] ? xen_swiotlb_map_sg_attrs+0x1b3/0x2e0
Sep 07 11:03:27 dom0 kernel: [<ffffffffc01a093e>] gen8_alloc_va_range+0x23e/0x490 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc01a250d>] i915_vma_bind+0x6d/0x120 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc01a6a91>] __i915_vma_do_pin+0x261/0x480 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc0197374>] i915_gem_execbuffer_reserve_vma.isra.42+0x144/0x1b0 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc0197775>] i915_gem_execbuffer_reserve.isra.43+0x395/0x3c0 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc0198ef1>] i915_gem_do_execbuffer.isra.48+0x511/0x16b0 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc019a442>] i915_gem_execbuffer2+0xa2/0x1e0 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc0094b3b>] drm_ioctl+0x21b/0x4d0 [drm]
Sep 07 11:03:27 dom0 kernel: [<ffffffffc019a3a0>] ? i915_gem_execbuffer+0x310/0x310 [i915]
Sep 07 11:03:27 dom0 kernel: [<ffffffff8110bf4d>] ? enqueue_hrtimer+0x3d/0x90
Sep 07 11:03:27 dom0 kernel: [<ffffffff817de41a>] ? _raw_spin_unlock_irqrestore+0x1a/0x20
Sep 07 11:03:27 dom0 kernel: [<ffffffff8110c8eb>] ? hrtimer_start_range_ns+0x1db/0x3f0
Sep 07 11:03:27 dom0 kernel: [<ffffffff8125a5d1>] do_vfs_ioctl+0xa1/0x5e0
Sep 07 11:03:27 dom0 kernel: [<ffffffff8114772b>] ? __audit_syscall_entry+0xab/0xf0
Sep 07 11:03:27 dom0 kernel: [<ffffffff8100346d>] ? syscall_trace_enter+0x1bd/0x2d0
Sep 07 11:03:27 dom0 kernel: [<ffffffff8125ab89>] SyS_ioctl+0x79/0x90
Sep 07 11:03:27 dom0 kernel: [<ffffffff81003c17>] do_syscall_64+0x67/0x180
Sep 07 11:03:27 dom0 kernel: [<ffffffff817de92b>] entry_SYSCALL64_slow_path+0x25/0x25
Sep 07 11:03:27 dom0 kernel: Mem-Info:
Sep 07 11:03:27 dom0 kernel: active_anon:100522 inactive_anon:104313 isolated_anon:0
active_file:82715 inactive_file:95342 isolated_file:0
unevictable:63 dirty:1762 writeback:0 unstable:0
slab_reclaimable:13452 slab_unreclaimable:25091
mapped:40201 shmem:123734 pagetables:7518 bounce:0
free:9739 free_pcp:185 free_cma:0
Sep 07 11:03:27 dom0 kernel: Node 0 active_anon:402088kB inactive_anon:417252kB active_file:330860kB inactive_file:381368kB unevictable:252kB isolated(anon):0kB isolated(file):0kB mapped
Sep 07 11:03:27 dom0 kernel: Node 0 DMA free:15608kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:16kB unevictable:0kB writepending:0kB pre
Sep 07 11:03:27 dom0 kernel: lowmem_reserve[]: 0 3362 3896 3896
Sep 07 11:03:27 dom0 kernel: Node 0 DMA32 free:20704kB min:6876kB low:10316kB high:13756kB active_anon:360864kB inactive_anon:334552kB active_file:242972kB inactive_file:302868kB unevict
Sep 07 11:03:27 dom0 kernel: lowmem_reserve[]: 0 0 534 534
Sep 07 11:03:27 dom0 kernel: Node 0 Normal free:2644kB min:1092kB low:1636kB high:2180kB active_anon:41224kB inactive_anon:82760kB active_file:87932kB inactive_file:78516kB unevictable:8
Sep 07 11:03:27 dom0 kernel: lowmem_reserve[]: 0 0 0 0
Sep 07 11:03:27 dom0 kernel: Node 0 DMA: 2*4kB (M) 2*8kB (UM) 4*16kB (ME) 3*32kB (UE) 1*64kB (M) 4*128kB (UME) 2*256kB (ME) 2*512kB (ME) 3*1024kB (UME) 1*2048kB (E) 2*4096kB (M) = 15608k
Sep 07 11:03:27 dom0 kernel: Node 0 DMA32: 2300*4kB (ME) 917*8kB (UME) 261*16kB (UME) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20712kB
Sep 07 11:03:27 dom0 kernel: Node 0 Normal: 99*4kB (H) 66*8kB (H) 36*16kB (H) 18*32kB (H) 7*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2652kB
Sep 07 11:03:27 dom0 kernel: 305827 total pagecache pages
Sep 07 11:03:27 dom0 kernel: 4004 pages in swap cache
Sep 07 11:03:27 dom0 kernel: Swap cache stats: add 270892, delete 266888, find 271422/295874
Sep 07 11:03:27 dom0 kernel: Free swap = 7619168kB
Sep 07 11:03:27 dom0 kernel: Total swap = 7864316kB
Sep 07 11:03:27 dom0 kernel: 1048575 pages RAM
Sep 07 11:03:27 dom0 kernel: 0 pages HighMem/MovableOnly
Sep 07 11:03:27 dom0 kernel: 537312 pages reserved
Sep 07 11:03:27 dom0 kernel: 0 pages hwpoisoned
Sep 07 11:03:27 dom0 kernel: Out of memory: Kill process 7916 (Xorg) score 15 or sacrifice child
Sep 07 11:03:27 dom0 kernel: Killed process 7916 (Xorg) total-vm:768604kB, anon-rss:33236kB, file-rss:7332kB, shmem-rss:103828kB
Sep 07 11:03:27 dom0 kernel: do_trap: 15 callbacks suppressed
Sep 07 11:03:27 dom0 kernel: in libglib-2.0.so.0.4600.2[70e163e67000+137000]
Sep 07 11:03:27 dom0 org.freedesktop.Notifications[7955]: xfce4-notifyd: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
Sep 07 11:03:27 dom0 org.gtk.vfs.Daemon[7955]: A connection to the bus can't be made
Sep 07 11:03:27 dom0 org.a11y.atspi.Registry[8038]: XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
Sep 07 11:03:27 dom0 org.a11y.atspi.Registry[8038]: after 53893 requests (53893 known processed) with 0 events remaining.
Sep 07 11:03:28 dom0 libvirtd[1786]: End of file while reading data: Input/output error
Since I rebooted the system two days ago, I'm unable to reproduce this. @rootkovska does it still happen to you?
This is still happening to me on a T450s on R3.2. Very unpleasant bug.
Happened to me today, twice.
@andrewdavidwong can you post dom0 kernel messages from that?
@marmarek: Sure. How can I access them?
dmesg
in dom0, or if you rebooted, journalctl
(see various options for filtering based on time).
Also, I'm trying another workaround: change oom_score_adj for X server, so even if OOM happen, it will not choose X server to kill.
echo -100 | sudo tee /proc/`pidof X`/oom_score_adj
This needs to be done after each X server startup. I can't yet confirm if that's effective enough.
Nov 03 21:58:33 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Nov 03 21:58:33 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Nov 03 21:58:33 dom0 kernel: CPU: 0 PID: 3483 Comm: Xorg Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Nov 03 21:58:33 dom0 kernel: Hardware name: LENOVO (...)
Nov 03 21:58:33 dom0 kernel: ffffc90008837500 ffffffff813e8e03 ffffc900088376d0 ffff8801008d3b00
Nov 03 21:58:33 dom0 kernel: ffffc90008837578 ffffffff81241efd 0000000000000003 0000000000000000
Nov 03 21:58:33 dom0 kernel: 0000000000000000 0000000000000000 ffff880135659b20 0000000000000001
Nov 03 21:58:33 dom0 kernel: Call Trace:
Nov 03 21:58:33 dom0 kernel: [<ffffffff813e8e03>] dump_stack+0x63/0x90
Nov 03 21:58:33 dom0 kernel: [<ffffffff81241efd>] dump_header+0x7b/0x1fd
Nov 03 21:58:33 dom0 kernel: [<ffffffff817e0baa>] ? _raw_spin_unlock_irqrestore+0x1a/0x20
Nov 03 21:58:33 dom0 kernel: [<ffffffff811b8939>] oom_kill_process+0x219/0x3e0
Nov 03 21:58:33 dom0 kernel: [<ffffffff811b8e33>] out_of_memory+0x143/0x4d0
Nov 03 21:58:33 dom0 kernel: [<ffffffff811be78a>] __alloc_pages_slowpath+0xbaa/0xbf0
Nov 03 21:58:33 dom0 kernel: [<ffffffff811bea26>] __alloc_pages_nodemask+0x256/0x2c0
Nov 03 21:58:33 dom0 kernel: [<ffffffff81213295>] alloc_pages_current+0x95/0x140
Nov 03 21:58:33 dom0 kernel: [<ffffffff811deff8>] kmalloc_order+0x18/0x40
Nov 03 21:58:33 dom0 kernel: [<ffffffff811df044>] kmalloc_order_trace+0x24/0xa0
Nov 03 21:58:33 dom0 kernel: [<ffffffff812208a3>] __kmalloc+0x1f3/0x210
Nov 03 21:58:33 dom0 kernel: [<ffffffffc0189b9e>] ? alloc_gen8_temp_bitmaps+0x2e/0x80 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffffc0189bb7>] alloc_gen8_temp_bitmaps+0x47/0x80 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffffc0189c90>] gen8_alloc_va_range_3lvl+0xa0/0x9b0 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffff811b3e4b>] ? find_lock_entry+0x5b/0x120
Nov 03 21:58:33 dom0 kernel: [<ffffffff8121ecad>] ? __slab_alloc+0x2d/0x40
Nov 03 21:58:33 dom0 kernel: [<ffffffff814dd073>] ? xen_swiotlb_map_sg_attrs+0x1b3/0x2e0
Nov 03 21:58:33 dom0 kernel: [<ffffffffc018a93e>] gen8_alloc_va_range+0x23e/0x490 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffffc018c50d>] i915_vma_bind+0x6d/0x120 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffffc0190a91>] __i915_vma_do_pin+0x261/0x480 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffffc0181374>] i915_gem_execbuffer_reserve_vma.isra.42+0x144/0x1b0 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffffc0181775>] i915_gem_execbuffer_reserve.isra.43+0x395/0x3c0 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffffc0182ef1>] i915_gem_do_execbuffer.isra.48+0x511/0x16b0 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffff813ea72a>] ? idr_alloc+0x9a/0x110
Nov 03 21:58:33 dom0 kernel: [<ffffffffc0184442>] i915_gem_execbuffer2+0xa2/0x1e0 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffffc0051b3b>] drm_ioctl+0x21b/0x4d0 [drm]
Nov 03 21:58:33 dom0 kernel: [<ffffffffc01843a0>] ? i915_gem_execbuffer+0x310/0x310 [i915]
Nov 03 21:58:33 dom0 kernel: [<ffffffff813f23d4>] ? timerqueue_del+0x24/0x70
Nov 03 21:58:33 dom0 kernel: [<ffffffff8110c44c>] ? __remove_hrtimer+0x3c/0x90
Nov 03 21:58:33 dom0 kernel: [<ffffffff8125b051>] do_vfs_ioctl+0xa1/0x5e0
Nov 03 21:58:33 dom0 kernel: [<ffffffff8110dcd2>] ? do_setitimer+0x142/0x230
Nov 03 21:58:33 dom0 kernel: [<ffffffff81147a0b>] ? __audit_syscall_entry+0xab/0xf0
Nov 03 21:58:33 dom0 kernel: [<ffffffff8100346d>] ? syscall_trace_enter+0x1bd/0x2d0
Nov 03 21:58:33 dom0 kernel: [<ffffffff8125b609>] SyS_ioctl+0x79/0x90
Nov 03 21:58:33 dom0 kernel: [<ffffffff81003c17>] do_syscall_64+0x67/0x180
Nov 03 21:58:33 dom0 kernel: [<ffffffff817e10eb>] entry_SYSCALL64_slow_path+0x25/0x25
Nov 03 21:58:33 dom0 kernel: Mem-Info:
Nov 03 21:58:33 dom0 kernel: active_anon:256076 inactive_anon:93796 isolated_anon:0
active_file:335245 inactive_file:126155 isolated_file:0
unevictable:170 dirty:188 writeback:0 unstable:0
slab_reclaimable:20140 slab_unreclaimable:31779
mapped:65136 shmem:145085 pagetables:14847 bounce:0
free:9657 free_pcp:312 free_cma:0
Nov 03 21:58:33 dom0 kernel: Node 0 active_anon:1024304kB inactive_anon:375184kB active_file:1340980kB inactive_file:504620kB unevictable:680kB isolated(anon):0kB isolated(file):0kB mapped:260544kB dirty:752kB writeback:0kB shmem:580340kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? no
Nov 03 21:58:33 dom0 kernel: Node 0 DMA free:15600kB min:28kB low:40kB high:52kB active_anon:68kB inactive_anon:0kB active_file:8kB inactive_file:156kB unevictable:0kB writepending:0kB present:15984kB managed:15900kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Nov 03 21:58:33 dom0 kernel: lowmem_reserve[]: 0 2939 3896 3896
Nov 03 21:58:33 dom0 kernel: Node 0 DMA32 free:20104kB min:6012kB low:9020kB high:12028kB active_anon:644904kB inactive_anon:302184kB active_file:1136472kB inactive_file:430896kB unevictable:512kB writepending:536kB present:3131348kB managed:3032808kB mlocked:512kB slab_reclaimable:55488kB slab_unreclaimable:67544kB kernel_stack:7536kB pagetables:49152kB bounce:0kB free_pcp:864kB local_pcp:388kB free_cma:0kB
Nov 03 21:58:33 dom0 kernel: lowmem_reserve[]: 0 0 956 956
Nov 03 21:58:33 dom0 kernel: Node 0 Normal free:2924kB min:1956kB low:2932kB high:3908kB active_anon:379332kB inactive_anon:73000kB active_file:204500kB inactive_file:73572kB unevictable:168kB writepending:40kB present:1046968kB managed:979848kB mlocked:168kB slab_reclaimable:25024kB slab_unreclaimable:59552kB kernel_stack:6064kB pagetables:10236kB bounce:0kB free_pcp:384kB local_pcp:20kB free_cma:0kB
Nov 03 21:58:33 dom0 kernel: lowmem_reserve[]: 0 0 0 0
Nov 03 21:58:33 dom0 kernel: Node 0 DMA: 6*4kB (UM) 1*8kB (U) 3*16kB (UMH) 3*32kB (UH) 3*64kB (UH) 1*128kB (U) 3*256kB (UMH) 2*512kB (MH) 3*1024kB (UMH) 1*2048kB (E) 2*4096kB (M) = 15600kB
Nov 03 21:58:33 dom0 kernel: Node 0 DMA32: 1060*4kB (MEH) 975*8kB (UMEH) 462*16kB (UMEH) 15*32kB (H) 2*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20040kB
Nov 03 21:58:33 dom0 kernel: Node 0 Normal: 523*4kB (UMEH) 38*8kB (UMEH) 16*16kB (UH) 6*32kB (H) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2844kB
Nov 03 21:58:33 dom0 kernel: 607006 total pagecache pages
Nov 03 21:58:33 dom0 kernel: 512 pages in swap cache
Nov 03 21:58:33 dom0 kernel: Swap cache stats: add 14293, delete 13781, find 151909/152125
Nov 03 21:58:33 dom0 kernel: Free swap = 10429640kB
Nov 03 21:58:33 dom0 kernel: Total swap = 10485756kB
Nov 03 21:58:33 dom0 kernel: 1048575 pages RAM
Nov 03 21:58:33 dom0 kernel: 0 pages HighMem/MovableOnly
Nov 03 21:58:33 dom0 kernel: 41436 pages reserved
Nov 03 21:58:33 dom0 kernel: 0 pages hwpoisoned
Nov 03 21:58:33 dom0 kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Nov 03 21:58:33 dom0 kernel: [ 1010] 0 1010 19717 5357 40 3 38 0 systemd-journal
(...)
Nov 03 21:58:33 dom0 kernel: Out of memory: Kill process 3483 (Xorg) score 15 or sacrifice child
Nov 03 21:58:33 dom0 kernel: Killed process 3483 (Xorg) total-vm:594472kB, anon-rss:57668kB, file-rss:8092kB, shmem-rss:167796kB
FWIW, I had OOM crashed of dom0 about three times or so. But only very seldom (not more than once per month). The system was always running for a longer time and there was quite some memory pressure overall.
Do you know when you observed this for the first time?
The logs are interesting. This seems not to be a classic OOM situation. There is free swap space and dom0 still has about 4 GiB total memory. Compared to the total memory Xorg also doesn't use that much.
And indeed all logs say __GFP_COMP
and the backtrace is somewhere from the kernel (i915 or xen-blkback). So I guess the kernel runs out of continuous memory (according to this article __GFP_COMP
allocations are always continuous).
Regarding further debugging. It's probably a good idea to enable more logging in qmemmand so we can see when it triggers ballooning. So for example create /etc/systemd/system/qubes-qmemman.service.d/debug.conf
with:
[Service]
ExecStart=
ExecStart=/usr/lib/qubes/qmemman_daemon.py -d
If that's about continuous memory, and in my case about xen-blkback
module, does it make sense to limit max_ring_page_order
parameter? It will have performance implications though...
If that's about continuous memory, and in my case about xen-blkback module, does it make sense to limit max_ring_page_order parameter? It will have performance implications though...
Not sure if this will help since it seems kmalloc()
uses compound pages for every "bigger" (with our config > 8192 B) allocation, so I guess there are many places using this.
The reason to not override /var/run/shm.id at X server startup, is that it may belong another (still running!) X server instance.
What about checking if another X server is running at startup?
What about checking if another X server is running at startup?
Any idea how do to that easy way? From what I know, X server does not have PID file, so it would require scraping over /proc, so sth like this. Also remember the code is running in constructor of LD_PRELOADed library, so there is very limited info about the current X server instance.
Adjusting oom_score_adj for X server made kernel killing qfile-daemon-dvm and systemd-journald (the problem happened when starting DispVM). Much less severe consequences :)
Here is full log of the incident: https://gist.github.com/marmarek/1cea76a2649c18b12187b09cc6ef7c7e
Just happened again, this time it killed a bunch of xfce4 related processes, including xfdesktop xfce4-panel children (all panel plugins), panel itself. And xfwm4. This means it required a lot of memory. Backtrack is the same. If another log could bring something new, just say. Maybe, if really that much of dom0 memory is needed at VM startup, we should calculate it, and make sure that dom0 do have that much free memory before starting a VM? @HW42
Or, maybe this is also about kernel config? Something about how memory is allocated (preference of continuous area? how much memory is keep free (swappiness?)?).
Just happened again, this time it killed a bunch of xfce4 related processes, including xfdesktop xfce4-panel children (all panel plugins), panel itself. And xfwm4. This means it required a lot of memory. Backtrack is the same. If another log could bring something new, just say.
Or maybe the killed processes didn't used enough continuous memory. But not sure if that's realistic.
Maybe, if really that much of dom0 memory is needed at VM startup, we should calculate it, and make sure that dom0 do have that much free memory before starting a VM? @HW42
Does this only happens when you start a new VM?
Or, maybe this is also about kernel config? Something about how memory is allocated (preference of continuous area? how much memory is keep free (swappiness?)?).
I guess this is more likely. Especially since this problem appeared recent-ish. Maybe dom0 linux 4.4 -> 4.9? Do we know when this was observed the first time?
On Wed, Nov 15, 2017 at 06:59:39AM +0000, HW42 wrote:
Maybe, if really that much of dom0 memory is needed at VM startup, we should calculate it, and make sure that dom0 do have that much free memory before starting a VM? @HW42
Does this only happens when you start a new VM?
Yes, at least for me.
Or, maybe this is also about kernel config? Something about how memory is allocated (preference of continuous area? how much memory is keep free (swappiness?)?).
I guess this is more likely. Especially since this problem appeared recent-ish. Maybe dom0 linux 4.4 -> 4.9? Do we know when this was observed the first time?
I'm almost sure it started after upgrade to 4.9 in dom0. But it might also started to happening after VM kernel upgrade - I've done that at very similar time.
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
(Possible) qubes-users thread on this bug: https://groups.google.com/d/topic/qubes-users/h070Y5NXJFo/discussion
I'm very likely also affected by this one on 4.0rc2.
I can re-produce it this way:
xscreensaver-command -lock
(assuming the daemon is running)This happens 1-2 times per day on my T530 with 16 GB RAM, if I use it. I more rarely get the crashes mentioned in the qubes-users thread (not 100% sure whether it's the same issue).
4.9.45-21 dom0 Kernel.
Side note: free -mh
in dom0 only shows 4 GB RAM, but I guess that's because Xen allocated the remaining 12 GB to the VMs?
P.S.: I didn't find anything interesting in sudo journalctl -b0
so far apart from maybe the following which occurs every few hours:
Nov 29 18:32:16 dom0 qmemman.systemstate[2054]: Xen free = 1667727686 too small for satisfy assignments! assigned_but_unused=1680891536, domdict={'13': {'id': '13', 'last_target': 150994944, 'memory_current': 150994944, 'mem_used': None, 'memory_actual': 150994944, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 150994944}, '14': {'id': '14', 'last_target': 928131916, 'memory_current': 945352704, 'mem_used': 332644352, 'memory_actual': 945352704, 'no_progress': True, 'slow_memset_react': False, 'memory_maximum': 4194304000}, '3': {'id': '3', 'last_target': 1108653483, 'memory_current': 1070624768, 'mem_used': 269041664, 'memory_actual': 1070625731, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 4194304000}, '1': {'id': '1', 'last_target': 419430400, 'memory_current': 417189888, 'mem_used': None, 'memory_actual': 419430400, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 419430400}, '0': {'id': '0', 'last_target': 5772307649, 'memory_current': 4172697600, 'mem_used': 1322520576, 'memory_actual': 5848884991, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 17100177408}, '4': {'id': '4', 'last_target': 1102621732, 'memory_current': 1064734720, 'mem_used': 267378688, 'memory_actual': 1064737563, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 4194304000}, '5': {'id': '5', 'last_target': 1048576000, 'memory_current': 1048334336, 'mem_used': 310456320, 'memory_actual': 1048576000, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 1048576000}, '15': {'id': '15', 'last_target': 150994944, 'memory_current': 150994944, 'mem_used': None, 'memory_actual': 150994944, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 150994944}, '2': {'id': '2', 'last_target': 419430400, 'memory_current': 416546816, 'mem_used': 204242944, 'memory_actual': 419430400, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 419430400}, '9': {'id': '9', 'last_target': 150994944, 'memory_current': 150994944, 'mem_used': None, 'memory_actual': 150994944, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 150994944}, '11': {'id': '11', 'last_target': 150994944, 'memory_current': 150994944, 'mem_used': None, 'memory_actual': 150994944, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 150994944}, '6': {'id': '6', 'last_target': 150994944, 'memory_current': 150994944, 'mem_used': None, 'memory_actual': 150994944, 'no_progress': False, 'slow_memset_react': False, 'memory_maximum': 150994944}, '8': {'id': '8', 'last_target': 1511855526, 'memory_current': 1529077760, 'mem_used': 349790208, 'memory_actual': 1529077760, 'no_progress': True, 'slow_memset_react': False, 'memory_maximum': 4194304000}, '12': {'id': '12', 'last_target': 2009606869, 'memory_current': 2026827776, 'mem_used': 504201216, 'memory_actual': 2026827776, 'no_progress': True, 'slow_memset_react': False, 'memory_maximum': 4194304000}, '10': {'id': '10', 'last_target': 1479561344, 'memory_current': 1496784896, 'mem_used': 341782528, 'memory_actual': 1496784896, 'no_progress': True, 'slow_memset_react': False, 'memory_maximum': 4194304000}}
In particular I didn't see any of the oom-killer entries from above.
I reproduced OOM-killer with my first attempt but failed to create with my second attempt... I make this issue by starting new ProxyVM(memory balancing enabled) though dom0 has already stated "no sufficient memory" to that VM. It was occured while system memory was almost out of and updating Windows 7 HVM (memory balancing disabled, 4Gib is attached).
Both Xorg and xenwatch was invoked this case. But first one was xenwatch. Below is short summary about my first attempt with multiple(9 times) OOM-killers activated consecutively.
Feb 12 16:45:30 dom0 kernel: xenwatch invoked oom-killer: gfp_mask=0x240c0c0(GFP_KERNEL|GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Feb 12 16:45:30 dom0 kernel: xenwatch cpuset=/ mems_allowed=0
Feb 12 16:45:30 dom0 kernel: CPU: 6 PID: 73 Comm: xenwatch Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Feb 12 16:45:30 dom0 kernel: ffffc9000130b948 ffffffff813e8e03 ffffc9000130bb18 ffff88009dc78000
Feb 12 16:45:30 dom0 kernel: ffffc9000130b9c0 ffffffff81241efd 0000000000000003 0000000000000000
Feb 12 16:45:30 dom0 kernel: 0000000000000000 0000000000000000 ffff880134799b20 0000000000000001
Feb 12 16:45:30 dom0 kernel: Call Trace:
Feb 12 16:45:30 dom0 kernel: [
Feb 12 16:45:33 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Feb 12 16:45:33 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Feb 12 16:45:33 dom0 kernel: CPU: 6 PID: 12008 Comm: Xorg Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Feb 12 16:45:33 dom0 kernel: ffffc90008ee7500 ffffffff813e8e03 ffffc90008ee76d0 ffff880002911e80
Feb 12 16:45:33 dom0 kernel: ffffc90008ee7578 ffffffff81241efd 0000000000000003 0000000000000000
Feb 12 16:45:33 dom0 kernel: 0000000000000000 0000000000000000 ffff880134799b20 0000000000000002
Feb 12 16:45:33 dom0 kernel: Call Trace:
Feb 12 16:45:33 dom0 kernel: [
Feb 12 16:45:33 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Feb 12 16:45:33 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Feb 12 16:45:33 dom0 kernel: CPU: 4 PID: 12008 Comm: Xorg Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Feb 12 16:45:33 dom0 kernel: ffffc90008ee7500 ffffffff813e8e03 ffffc90008ee76d0 ffff8801359c5b80
Feb 12 16:45:33 dom0 kernel: ffffc90008ee7578 ffffffff81241efd 0000000000000003 0000000000000000
Feb 12 16:45:33 dom0 kernel: 0000000000000000 0000000000000000 ffff880134799b20 0000000000000003
Feb 12 16:45:33 dom0 kernel: Call Trace:
Feb 12 16:45:33 dom0 kernel: [
eb 12 16:45:46 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Feb 12 16:45:46 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Feb 12 16:45:46 dom0 kernel: CPU: 2 PID: 12116 Comm: Xorg Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Feb 12 16:45:46 dom0 kernel: ffffc900015d7500 ffffffff813e8e03 ffffc900015d76d0 ffff880080a2db80
Feb 12 16:45:46 dom0 kernel: ffffc900015d7578 ffffffff81241efd 0000000000000003 0000000000000000
Feb 12 16:45:46 dom0 kernel: 0000000000000000 0000000000000000 ffff880134799b20 0000000000000001
Feb 12 16:45:46 dom0 kernel: Call Trace:
Feb 12 16:45:46 dom0 kernel: [
Feb 12 16:45:47 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Feb 12 16:45:47 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Feb 12 16:45:47 dom0 kernel: CPU: 0 PID: 12116 Comm: Xorg Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Feb 12 16:45:47 dom0 kernel: ffffc900015d7500 ffffffff813e8e03 ffffc900015d76d0 ffff880080a2db80
Feb 12 16:45:47 dom0 kernel: ffffc900015d7578 ffffffff81241efd 0000000000000003 0000000000000000
Feb 12 16:45:47 dom0 kernel: 0000138f262a0e5e 00000000002354d9 0000000000000000 0000000000000002
Feb 12 16:45:47 dom0 kernel: Call Trace:
Feb 12 16:45:47 dom0 kernel: [
Feb 12 16:45:47 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Feb 12 16:45:47 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Feb 12 16:45:47 dom0 kernel: CPU: 6 PID: 12116 Comm: Xorg Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Feb 12 16:45:47 dom0 kernel: ffffc900015d7500 ffffffff813e8e03 ffffc900015d76d0 ffff8800b6020000
Feb 12 16:45:47 dom0 kernel: ffffc900015d7578 ffffffff81241efd 0000000000000003 0000000000000000
Feb 12 16:45:47 dom0 kernel: 0000000000000000 0000000000000000 ffff880134799b20 0000000000000003
Feb 12 16:45:47 dom0 kernel: Call Trace:
Feb 12 16:45:47 dom0 kernel: [
Feb 12 16:45:48 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Feb 12 16:45:48 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Feb 12 16:45:48 dom0 kernel: CPU: 0 PID: 12116 Comm: Xorg Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Feb 12 16:45:48 dom0 kernel: ffffc900015d7500 ffffffff813e8e03 ffffc900015d76d0 ffff8800af28bd00
Feb 12 16:45:48 dom0 kernel: ffffc900015d7578 ffffffff81241efd 0000000000000003 0000000000000000
Feb 12 16:45:48 dom0 kernel: 0000000000000000 0000000000000000 ffff880134799b20 0000000000000004
Feb 12 16:45:48 dom0 kernel: Call Trace:
Feb 12 16:45:48 dom0 kernel: [
Feb 12 16:45:55 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Feb 12 16:45:55 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Feb 12 16:45:55 dom0 kernel: CPU: 0 PID: 12363 Comm: Xorg Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Feb 12 16:45:55 dom0 kernel: ffffc90003dbb500 ffffffff813e8e03 ffffc90003dbb6d0 ffff8801343a9e80
Feb 12 16:45:55 dom0 kernel: ffffc90003dbb578 ffffffff81241efd 0000000000000003 0000000000000000
Feb 12 16:45:55 dom0 kernel: 0000000000000000 0000000000000000 ffff880134799b20 0000000000000001
Feb 12 16:45:55 dom0 kernel: Call Trace:
Feb 12 16:45:55 dom0 kernel: [
Feb 12 16:45:55 dom0 kernel: Xorg invoked oom-killer: gfp_mask=0x240c0d0(GFP_TEMPORARY|GFP_COMP|__GFP_ZERO), nodemask=0, order=3, oom_score_adj=0
Feb 12 16:45:55 dom0 kernel: Xorg cpuset=/ mems_allowed=0
Feb 12 16:45:55 dom0 kernel: CPU: 4 PID: 12363 Comm: Xorg Tainted: G U 4.9.56-21.pvops.qubes.x86_64 #1
Feb 12 16:45:55 dom0 kernel: ffffc90003dbb500 ffffffff813e8e03 ffffc90003dbb6d0 ffff8800af289e80
Feb 12 16:45:55 dom0 kernel: ffffc90003dbb578 ffffffff81241efd 0000000000000003 0000000000000000
Feb 12 16:45:55 dom0 kernel: 0000000000000000 0000000000000000 ffff880134799b20 0000000000000002
Feb 12 16:45:55 dom0 kernel: Call Trace:
Feb 12 16:45:55 dom0 kernel: [
Have you tried 4.14 kernel in dom0? It's in testing repository.
-> @marmarek, I will test 4.14 kernel after one more attempt to reproduce on current kernel.
Happened again in 4.9.56-21 kernel. Also it happened while starting new vm. I'm now testing 4.14.18-1 kernel from this time.
BTW(it is off-topic), why this new kernel doesn't show on Qubes-Global-Settings and Settings of each qubes? They report that I'm still using 4.9.56-21 but "uname -r" reports me that I'm using new kernel. Also login screen seems crashed while starting system though I can login...
I noticed that my X server mostly crashes after a while when I use my docking station. So possibly a different issue.
I haven't experienced oom-killer so far when using 4.14.18-1 kernel. But I've suffered sudden stop of my USB mouse attached to Sys-USB while dom0 is refusing to give back memory. To be specific, I think gui stopped at that moment? I can hear sound coming from my VM though.
Today, I have experienced that again but got worse. Entire system sudden freezed and it seemed that this system is not possible to recover without reboot. It exactly happened at 17:07:18 while starting up 4096MB memory assigned VM. I cannot issue any command or even tty (using ps/2 keyboard), mouse has been stopped, screen has freezed. I manually rebooted through physical way after approximately 1 minute.
After reboot, I found nothing on journalctl between this event and rebooting. But I don't think domU died because there was some network traffic on modem. BTW, here is some journalctl log which looks like everything is okay before happening.
Mar 08 17:07:17 dom0 kernel: audit: type=1100 audit(1520496437.757:35355): pid=15491 uid=1000 auid=1000 ses=1 msg='op=PAM:authentication grantors=pam_localuser acct="root" exe="/usr/sbin/userhelper" hostname=? addr=? terminal=? res=success' Mar 08 17:07:17 dom0 kernel: audit: type=1101 audit(1520496437.757:35356): pid=15491 uid=1000 auid=1000 ses=1 msg='op=PAM:accounting grantors=pam_permit acct="root" exe="/usr/sbin/userhelper" hostname=? addr=? terminal=? res=success' Mar 08 17:07:17 dom0 userhelper[15491]: running '/usr/sbin/hddtemp -n -q /dev/sda' with root privileges on behalf of 'singingstar' Mar 08 17:07:18 dom0 audit[15493]: USER_AUTH pid=15493 uid=1000 auid=1000 ses=1 msg='op=PAM:authentication grantors=pam_localuser acct="root" exe="/usr/sbin/userhelper" hostname=? addr=? terminal=? res=success' Mar 08 17:07:18 dom0 audit[15493]: USER_ACCT pid=15493 uid=1000 auid=1000 ses=1 msg='op=PAM:accounting grantors=pam_permit acct="root" exe="/usr/sbin/userhelper" hostname=? addr=? terminal=? res=success' Mar 08 17:07:18 dom0 userhelper[15493]: running '/usr/sbin/hddtemp -n -q /dev/sda' with root privileges on behalf of 'singingstar' -- Reboot -- Mar 08 17:10:12 dom0 systemd-journal[248]: Runtime journal is using 8.0M (max allowed 196.3M, trying to leave 294.4M free of 1.9G available → current limit 196.3M). Mar 08 17:10:12 dom0 systemd-journal[248]: Runtime journal is using 8.0M (max allowed 196.3M, trying to leave 294.4M free of 1.9G available → current limit 196.3M). Mar 08 17:10:12 dom0 kernel: Linux version 4.14.18-1.pvops.qubes.x86_64 (user@) (gcc version 5.3.1 20160406 (Red Hat 5.3.1-6) (GCC)) #1 SMP Thu Feb 8 19:37:36 UTC 2018
Note that there was some dependency problem while installing new qubes kernel
@HW42 I'm pretty sure it's block backend that takes whatever type of memory is "missing". Not exactly sure if that is block backend's bug. It may be that kernel should keep some bigger reserves or something. This may be also related to using multiple queues (new kernels use 4 per device, I think older use 1 or 2), so newer kernel simply takes more memory.
This is already fixed in Qubes 4.0. But the fix (or actually - the update procedure) is too fragile to backport it to 3.2, because it involve renaming /var/run/shm.id - which is referenced from various places.
@marmarek, Does this same case of oom-killer? I'm quite wondering if it is different case, how could this impact 4.0? (though you've already implemented solution on 4.0). BTW, is anything I can help you with?
The oom-killer issue is separate than the one referred above (restarting X server and having applications back). The oom-killer one is still and issue, without known solution yet.
@marmarek, I think that they happened on similar circumstance. They occurred when system RAM is almost out of, OS is power on at least an hour, and opening VM with big(>4096MB) RAM attached. Though the last factor is not mandatory since oom-killer happened without opening new VM. However the first one is. So, does increasing physical RAM to system could mitigate problems? If it is, then adding RAM would increase system stability.
Well, some are possible to be relaxed after a while but the one happened Mar 8 makes user to manually reboot system to recover. Apparaently, no oom-killer so far after changing kernel.
I'll test 4.0 if stable release come out. All system resources would be same as 3.2.
This haven't happened for me for over a month. Looks like some kernel update have fixed it.
I just had another one of these:
Jun 20 21:18:28 dom0 kernel: Out of memory: Kill process 3388 (Xorg) score 9 or sacrifice child
Jun 20 21:18:28 dom0 kernel: Killed process 3388 (Xorg) total-vm:457280kB, anon-rss:37592kB, file-rss:6692kB, shmem-rss:99764kB
Jun 20 21:18:28 dom0 org.freedesktop.Notifications[3449]: xfce4-notifyd: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
Jun 20 21:18:28 dom0 kernel: do_trap: 27 callbacks suppressed
Jun 20 21:18:28 dom0 kernel: traps: xss-lock[3515] trap int3 ip:773dfa28d86b sp:7ffc850b08d0 error:0
Jun 20 21:18:28 dom0 kernel: in libglib-2.0.so.0.4600.2[773dfa23d000+137000]
Jun 20 21:18:28 dom0 kernel: audit: type=1701 audit(1529547508.337:18574): auid=1000 uid=1000 gid=1001 ses=1 pid=3515 comm="xss-lock" exe="/usr/bin/xss-lock" sig=5
Jun 20 21:18:28 dom0 audit[3515]: ANOM_ABEND auid=1000 uid=1000 gid=1001 ses=1 pid=3515 comm="xss-lock" exe="/usr/bin/xss-lock" sig=5
Jun 20 21:18:28 dom0 org.a11y.atspi.Registry[3520]: XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
Jun 20 21:18:28 dom0 org.a11y.atspi.Registry[3520]: after 293257 requests (293257 known processed) with 0 events remaining.
Jun 20 21:18:28 dom0 kernel: oom_reaper: reaped process 3388 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:99848kB
Jun 20 21:18:28 dom0 libvirtd[1651]: End of file while reading data: Input/output error
Jun 20 21:18:28 dom0 libvirtd[1651]: End of file while reading data: Input/output error
Jun 20 21:18:28 dom0 polkitd[3396]: Unregistered Authentication Agent for unix-session:1 (system bus name :1.16, object path /org/gnome/PolicyKit1/AuthenticationAgent, locale en_US.utf8) (di
Jun 20 21:18:28 dom0 org.gtk.vfs.Daemon[3449]: A connection to the bus can't be made
Jun 20 21:18:28 dom0 audit[3431]: USER_LOGOUT pid=3431 uid=0 auid=1000 ses=1 msg='op=logout id=1000 exe="/usr/sbin/lightdm" hostname=? addr=? terminal=/dev/tty1 res=success'
Jun 20 21:18:28 dom0 kernel: audit: type=1113 audit(1529547508.414:18575): pid=3431 uid=0 auid=1000 ses=1 msg='op=logout id=1000 exe="/usr/sbin/lightdm" hostname=? addr=? terminal=/dev/tty1
Jun 20 21:18:28 dom0 lightdm[3431]: pam_unix(lightdm:session): session closed for user user
Jun 20 21:18:28 dom0 audit[3431]: USER_END pid=3431 uid=0 auid=1000 ses=1 msg='op=PAM:session_close grantors=pam_selinux,pam_loginuid,pam_selinux,pam_keyinit,pam_namespace,pam_keyinit,pam_li
Jun 20 21:18:28 dom0 kernel: audit: type=1106 audit(1529547508.440:18576): pid=3431 uid=0 auid=1000 ses=1 msg='op=PAM:session_close grantors=pam_selinux,pam_loginuid,pam_selinux,pam_keyinit,
Jun 20 21:18:28 dom0 audit[3431]: CRED_DISP pid=3431 uid=0 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_unix acct="user" exe="/usr/sbin/lightdm" hostname=? addr=? terminal=:0 res=success'
Jun 20 21:18:28 dom0 kernel: audit: type=1104 audit(1529547508.441:18577): pid=3431 uid=0 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_unix acct="user" exe="/usr/sbin/lightdm" hostname=?
Jun 20 21:18:29 dom0 audit[22346]: CRED_ACQ pid=22346 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:setcred grantors=pam_env,pam_permit acct="lightdm" exe="/usr/sbin/lightdm" hostname=? a
Jun 20 21:18:29 dom0 kernel: audit: type=1103 audit(1529547509.749:18578): pid=22346 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:setcred grantors=pam_env,pam_permit acct="lightdm" exe="
Jun 20 21:18:29 dom0 lightdm[22346]: pam_unix(lightdm-greeter:session): session opened for user lightdm by (uid=0)
Jun 20 21:18:29 dom0 systemd[1]: Started Session c1210 of user lightdm.
Jun 20 21:18:29 dom0 systemd-logind[1629]: New session c1210 of user lightdm.
Jun 20 21:18:29 dom0 systemd[1]: Starting Session c1210 of user lightdm.
Jun 20 21:18:29 dom0 kernel: audit: type=1105 audit(1529547509.777:18579): pid=22346 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:session_open grantors=pam_unix,pam_systemd acct="lightdm
Jun 20 21:18:29 dom0 audit[22346]: USER_START pid=22346 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:session_open grantors=pam_unix,pam_systemd acct="lightdm" exe="/usr/sbin/lightdm" hos
Jun 20 21:18:30 dom0 systemd-coredump[22339]: Process 3515 (xss-lock) of user 1000 dumped core.
Stack trace of thread 3515:
#0 0x0000773dfa28d86b g_logv (libglib-2.0.so.0)
#1 0x0000773dfa28d9df g_log (libglib-2.0.so.0)
#2 0x00005642c449788d screensaver_event_cb (xss-lock)
#3 0x00005642c44979d7 xcb_event_dispatch (xss-lock)
#4 0x0000773dfa286e5a g_main_context_dispatch (libglib-2.0.so.0)
#5 0x0000773dfa2871f0 g_main_context_iterate.isra.29 (libglib-2.0.so.0)
#6 0x0000773dfa287512 g_main_loop_run (libglib-2.0.so.0)
#7 0x00005642c4496c5e main (xss-lock)
#8 0x0000773df9871580 __libc_start_main (libc.so.6)
#9 0x00005642c4496d69 _start (xss-lock)
Stack trace of thread 3522:
#0 0x0000773df9947b7d poll (libc.so.6)
#1 0x0000773dfa28718c g_main_context_iterate.isra.29 (libglib-2.0.so.0)
#2 0x0000773dfa28729c g_main_context_iteration (libglib-2.0.so.0)
#3 0x0000773dfa2872d9 glib_worker_main (libglib-2.0.so.0)
#4 0x0000773dfa2ad8e5 g_thread_proxy (libglib-2.0.so.0)
#5 0x0000773df8bda61a start_thread (libpthread.so.0)
#6 0x0000773df99535fd __clone (libc.so.6)
Stack trace of thread 3528:
#0 0x0000773df9947b7d poll (libc.so.6)
#1 0x0000773dfa28718c g_main_context_iterate.isra.29 (libglib-2.0.so.0)
#2 0x0000773dfa287512 g_main_loop_run (libglib-2.0.so.0)
#3 0x0000773dfa8a8516 gdbus_shared_thread_func (libgio-2.0.so.0)
#4 0x0000773dfa2ad8e5 g_thread_proxy (libglib-2.0.so.0)
#5 0x0000773df8bda61a start_thread (libpthread.so.0)
#6 0x0000773df99535fd __clone (libc.so.6)
Jun 20 21:18:31 dom0 systemd-logind[1629]: Removed session c1.
Having this issue on Linux Mint as well, kernel 4.18.0-13. Relevant errors:
gnome-terminal-[6663]: gnome-terminal-server: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
org.a11y.Bus[1458]: XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
dbus-daemon[10573]: [session uid=1000 pid=10571] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
This may be a fix: https://lists.xenproject.org/archives/html/xen-devel/2019-05/msg00273.html
FWIW, I have not observed this on my system for a long time. Not sure if anything has changed, or just been lucky.
I haven't seen it either. The above linked patch is scheduled for Linux 5.2, so it isn't it the change that made it go away. But there are quite a lot of fixes in stable Linux kernels and probably some of them fixed it. Lets close it for now, if that happens again we can always reopen.
Qubes OS version (e.g.,
R3.2
):In the last weeks I've been experiencing frequent Out Of Memory (OOM) killer actions in dom0, likely due to incorrect memory reporting by the dom0 kernel. This results in X in dom0 getting restarted. In order to recover the system, the user must:
/var/run/shm.id
file,One problem is likely some (upstream?) bug in memory reporting, which results on Qubes memory balance (qmemman) incorrectly taking too much memory from dom0, leaving the OOM no other choice than to kill some memory-hungry process (and X is often a good candidate).
Another problem (Qubes-related) is the need to manually remove the
shm.id
file and restart the X. This should be done automatically, so that the user could just log in and have all the windows automatically restored (guid connected back).