QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
536 stars 47 forks source link

`deferring g.e. 0x... (pfn 0x...)` messages flooding all VMs #7359

Open SaswatPadhi opened 2 years ago

SaswatPadhi commented 2 years ago

Qubes OS release

R4.1

Brief summary

After the recent Qubes updates (to dom0 and all templateVMs), I see a lot of deferring g.e. 0x... (pfn 0x...) messages in dmesg. Not sure if these message relate to the fact that my Qubes system becomes unusable after a few (~4) hours. While everything works fine for a few hours after restart, Qubes ultimately slows down to the point where it's no longer usable.

Steps to reproduce

  1. Update qubes-* packages in dom0 and a TemplateVM to latest stable versions
  2. Restart the system
  3. Run dmesg in dom0 and any AppVM

Expected behavior

System runs as before, no Xen warnings flooding dmesg

Actual behavior

Multiple deferring g.e. 0x... (pfn 0x...) messages per second in dom0 and all AppVMs

noskb commented 2 years ago

Same, except no slowdown occurs, and I also saw the following flood of messages at AppVM. [130669.067576] xen:grant_table: g.e. 0x679a still pending

DemiMarie commented 2 years ago

Multiple deferring g.e. 0x... (pfn 0x...) messages per second in dom0 and all AppVMs

This looks like it is related to the recent frontend security fixes. I suspect the backend is not releasing grant references quickly enough.

rustybird commented 2 years ago

Running dmesg -w in the VM, I see a ton of those messages whenever any of its windows is resized, opened, or closed. This also consumes more and more memory, and eventually the VM slows down to a crawl as it starts swapping.

Latest current-testing everywhere, but IIRC it's been this way since I switched to R4.1 a couple of weeks ago. (Back then I hadn't yet tried to really trigger it in the worst way by resizing windows.)

DemiMarie commented 2 years ago

Running dmesg -w in the VM, I see a ton of those messages whenever any of its windows is resized, opened, or closed. This also consumes more and more memory, and eventually things slow down to a crawl due to the VM swapping.

Latest current-testing everywhere, but IIRC it's been this way since I switched to R4.1 a couple of weeks ago. (Back then I hadn't yet tried to really trigger it in the worst way by resizing windows.)

This looks like a bug in the GUI daemon and/or the dom0 kernel: it needs to release the grant tables when it is done using them.

kr4t0 commented 2 years ago

Can Confirm this here on and AMD Renoir System (Thinkpad P14s Gen2). These are found in dom0 aswell as in AppVMs. I am happy to provide further debug info if needed. Problem is that log auditing gets really annyoing in qubes due to this

3hhh commented 2 years ago

I can confirm this as well.

noptys commented 2 years ago

This message flooding is happening in latest R4.0 also, in dom0 and AppVMs. Currently I'm not seeing noticable slowdown on any R4.0 or R4.1 machines, and message rate seems to be lower in R4.0.

SaswatPadhi commented 2 years ago

A small note: this also results in the journal occupying a lot of space. Shouldn't be a concern for AppVMs, but I just recovered several GBs in my dom0 by vacuuming the journal

Tonux599 commented 2 years ago

Just to add my voice that I'm experiencing this issue also, as laid out in #7456.

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.17.4-2.fc25.qubes) has been pushed to the r4.0 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.17.4-2.fc32.qubes) has been pushed to the r4.1 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component linux-kernel-5-4 (including package kernel-5.4.190-1.fc25.qubes) has been pushed to the r4.0 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component linux-kernel (including package kernel-5.10.112-1.fc32.qubes) has been pushed to the r4.1 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component linux-kernel-4-19 (including package kernel-419-4.19.239-1.pvops.qubes) has been pushed to the r4.0 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component linux-kernel (including package kernel-5.10.112-1.fc32.qubes) has been pushed to the r4.1 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component linux-kernel-5-4 (including package kernel-5.4.190-1.fc25.qubes) has been pushed to the r4.0 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component linux-kernel-4-19 (including package kernel-419-4.19.245-1.pvops.qubes) has been pushed to the r4.0 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.18.9-1.fc32.qubes) has been pushed to the r4.1 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Tonux599 commented 2 years ago

I fear there may have been a regression somewhere as today I reexperienced this issue. On kernel 5.15.52 and was unable to qvm-run on any AppVM even if restarting it. Effectively locking up my system until a reboot.

Errors noted included the deferring g.e. 0x... as well as gntshr: error: ioctl failed: No space left on device.

DemiMarie commented 2 years ago

I fear there may have been a regression somewhere as today I reexperienced this issue. On kernel 5.15.52 and was unable to qvm-run on any AppVM even if restarting it. Effectively locking up my system until a reboot.

Errors noted included the deferring g.e. 0x... as well as gntshr: error: ioctl failed: No space left on device.

Can you upgrade to a newer kernel and see if this fixes the problem?

Tonux599 commented 2 years ago

Hi @DemiMarie,

To confirm reproducibility, on kernel 5.15.52 (for both dom0 and AppVM's) running qvm-run {appvm} echo causes deferring g.e. 0x... in dmesg for dom0. Also resizing a window in said AppVM causes deferring g.e. 0x... messages in dmesg for that AppVM.

I can confirm that using kernel 5.18.9 (for both dom0 and AppVM's) causes these error messages to stop. However, resizing an AppVM window causes a trace in dmesg for dom0. Ultimately I would prefer to continue using kernel as opposed to kernel-latest.

The trace contains reference to WARNING: CPU: 0 PID: 5123 at drivers/xen/gntdev.c:399 __unmap_grant_pages_done+0xfe/0x110 [xen_gntdev]

I can provide a full trace, however I would want to PGP encrypt it to you as it contains hardware identifiers. I can't find your PGP key in qubes-secpack though? Alternatively I know your on matrix so I can send it there.

marmarek commented 2 years ago

Those deferring messages are kind of expected, only still pending are worrying. It seems journald logs printk(KERN_DEBUG ...) messages just like any other, and it isn't obvious to distinguish them from actual errors by normal user. @DemiMarie can you change that to pr_debug, so it would be disabled by default? The leaking and still pending should be still logged by default.

As for WARNING, that's fixed in newer kernel already (there is https://github.com/QubesOS/updates-status/issues/3030 in testing repo).

Tonux599 commented 2 years ago

Thanks @marmarek good to know.

Having a look at the logs from yesterday when the lockup happened and noticed also the following, is this any concern?

xen:grant_table: xen/grant-table: max_grant_frames reached cur=2048 extra=1 limit=2048 gnttab_free_count=0 req_entries=1

If I recall correctly a hotfix for this issue used to be to increase /sys/module/xen_gntalloc/parameters/limit. So not sure if that log message from yesterday is a symptom of something?

3hhh commented 2 years ago

I currently don't see it in dom0 on 5.10.128-1, but a lot in domU on 5.15.57-1.

unman commented 2 years ago

On Tue, Aug 30, 2022 at 08:02:43AM -0700, 3hhh wrote:

I currently don't see it in dom0 on 5.10.128-1, but a lot in domU on 5.15.57-1.

Dont see it at all with 5.15.63-1

ctr49 commented 1 year ago

Those deferring messages are kind of expected, only still pending are worrying.

@marmarek When you say they are kind of expected, why are they thrown in the first place? This is flooding kmsg up to a point where it becomes unusable (i.e. for debugging). Is there anything we can do (i.e. kernel cmdline option) to suppress them?

DemiMarie commented 1 year ago

@ctr49: they should be suppressed by default (that they are not is a bug).

ctr49 commented 1 year ago

Right, and afaict it's caused by https://github.com/QubesOS/qubes-linux-kernel/blob/master/increase-reclaim-speed.patch#L124 (a bare printk, without pr_debug/pr_warn)

see https://github.com/QubesOS/qubes-linux-kernel/pull/682

marmarek commented 1 year ago

but a lot in domU on 5.15.57-1.

This is rather old 5.15.x kernel. The bug was fixed in >=5.15.61 already. https://github.com/QubesOS/updates-status/issues/3037 was uploaded to the stable repo over 2 months ago.

I'm closing this now as resolved. Please let me know if you still can reproduce issue with up to date kernel.

ctr49 commented 1 year ago

@marmarek I'm still seeing this on the very latest 6.0.8-1.fc32.qubes from current-testing

DemiMarie commented 1 year ago

@ctr49 What kernels are you using in guests? Does the problem happen in guests using a dom0-provided kernel?

jamke commented 1 year ago

I also have this issue on Macbook Pro, e.g. on 6.0.11-200.fc36 kernel.

ctr49 commented 1 year ago

I cannot observe the problem when using the dom0-provided kernel, but with different domU templates using the OS-supplied kernel booted via HVM, i.e. Debian and Gentoo (i.e. 5.15.80-gentoo-dist)

DemiMarie commented 1 year ago

@ctr49 that is useful information! @jamke is this with a dom0-provided kernel or a guest-provided kernel?

jamke commented 1 year ago

@DemiMarie Also a HVM-Standalone with its own kernel. Because in my case I needed a broadcom-wl module to be built with dkms.

P.S. Is it possible to use something like dkms kernel module with the dom0-provided kernel?

DemiMarie commented 1 year ago

@jamke: Thanks! That is very useful information too. Together with what @ctr49 stated I now know the problem is that one of Qubes’s kernel patches should have been upstreamed but wasn’t.

marmarek commented 1 year ago

P.S. Is it possible to use something like dkms kernel module with the dom0-provided kernel?

Yes, it should work out of the box. dom0-provided kernel has necessary headers included by default. If you have some issues, open separate issue, or a thread on the forum.

jamke commented 1 year ago

Yes, it should work out of the box. dom0-provided kernel has necessary headers included by default.

OK, I might try it.

If you have some issues, open separate issue, or a thread on the forum.

No, I have exactly this issue, just was looking for a workaround.

qubesos-bot commented 1 year ago

Automated announcement from builder-github

The component linux-kernel (including package kernel-6.1.26-1.qubes.fc32) has been pushed to the r4.1 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

ctr49 commented 1 year ago

If this is an upstream Linux kernel issue (as https://github.com/QubesOS/qubes-issues/issues/7359#issuecomment-1350598282 suggested) then maybe the reference to qubes kernels should be removed, as their updates are not relevant for this issue.

qubesos-bot commented 1 year ago

Automated announcement from builder-github

The component linux-kernel (including package kernel-6.1.35-1.qubes.fc32) has been pushed to the r4.1 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update