Open dylangerdaly opened 3 years ago
kernel-latest installed?
Indeed, 5.8.8-200, 4.14-3 Xen
ah, yes. try going back to Xen 4.13.1-4 in dom0. 4.14 was terrible for me too, wrt graphics performance that is.
Aw, I can't, Ryzen 4000 series CPUs won't boot unless I'm using 4.14, I think I'll just need to wait for 4.15?
Can I rebase 4.14 to Xen's Master? I don't think Marek has made many changes to Xen
I dunno, it probably needs troubleshooting to get it working right on 4.14, I just didn't really know where to look for error reports. (To be clear, I didn't have CPU lockups, just display issues once I opened more than a few VM windows concurrently.)
Oooooooo thank for @0spinboson for point me in the right direction
I think this has been fixed via adding processor.max_cstate=5
to CMDLINE, I'm no longer getting softlocks and it's :butter: smooth.
I'll do some more testing before I close this issue :tada:
Hm, not quite, it worked for one boot, but not subsequent boots.
It does seem to be related to a public AMD issue because the bug is everywhere.
Fedora Workstation 32 Live USB works just fine, I'll have a look at it's default CMDLINE.
you mean you added it to grub bootline? Did you run grub2-mkconfig after?
Yeah I've been testing with just hitting 'e' and inserting stuff into CMDLINE, I added processor.max_cstate=5
to my actual grub2 config and did mkconfig etc same result.
It looks like it's could be related to IRQs, if I just leave it idling it'll be okay, if I start doing thing it'll soft lock.
It looks like AMD created a sixth C-State, setting it to 5 simply removes that C-State.
just to be clear: you added it to /etc/default/grub, then ran grub2-mkconfig -o /boot/grub2/grub.cfg (assuming you use legacy boot, idk the efi equivalent), but it doesn't persist?
Yes, sometimes my HVMs don't come up, sometimes they do, I think this CPU has had basically no testing with Xen/Virtualization
Yeah Fedora 32 Live ISO runs like butter every time with a 5.6.6 kernel, really weird
This looks very similar to what I exprienced on my Ryzen 4000 series laptop. While trying to fix wake-from-suspend (still not resolved), I discovered that the following workaround solves the lockup/stutter issue:
dom0_max_vcpus=1 dom0_vcpus_pin
to GRUB_CMDLINE_XEN_DEFAULT
in file etc/default/grub
.sudo grub2-mkconfig -o /boot/efi/EFI/qubes/grub.cfg
(assuming this is a Qubes 4.1 UEFI install).You absolute legend!
This has totally worked 🎉
I'll do some more testing with this today, but really smooth now.
Other appVMs are working smoothly!
Thank you 🙌
Sent from ProtonMail mobile
-------- Original Message -------- On Sep 16, 2020, 5:59 AM, Stian Ellingsen wrote:
This looks very similar to what I exprienced on my Ryzen 4000 series laptop. While trying to fix wake-from-suspend (still not resolved), I discovered that the following workaround solves the lockup/stutter issue:
- Add options dom0_max_vcpus=1 dom0_vcpus_pin to GRUB_CMDLINE_XEN_DEFAULT in file etc/default/grub.
- Run sudo grub2-mkconfig -o /boot/efi/EFI/qubes/grub.cfg (assuming this is a Qubes 4.1 UEFI install).
- Reboot.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Can it be related to (lack of) NUMA support in Qubes?
Yeah wow, Electron Applications (Element and Chromium for example) are running buttery smooth, even better than my Intel 10th Gen device running 4.1
Dare I say it's running better than 4.0.3
Can it be related to (lack of) NUMA support in Qubes?
I think so, I'm trying to understanding what dom0_max_vcpus=1
is actually doing, am I limiting my cores to 1? Because it doesn't feel like I'm only using a single core.
I've never seen Qubes run this smoothly before :butter: almost feels like I'm cheating
am I limiting my cores to 1
correct. you can probably also set it to 2 though. :)
Can I request a clarification for the thread?
This setting change (“dom0_max_vcpus=1“) only sets the virtual cpu limit for the dom0 VM and does not impact virtual cpus available to domU VMs, correct?
If so, then I’d posit the likely impact of the workaround would primarily be on dom0-attached storage latency/throughput for most users (vs. faster with multiple vCPUs in dom0 with a real fix for the sleep states issue). Perhaps also high-throughput windowing (video playback from a domU VM).
Though with AMDs current CPU lineup, the hit might not even be really noticeable outside of benchmarks. :)
Oof yeah good point, I messed up default storage pools, so I'm migrating from varlibqubes to lvm, just via dd.
It is slooooooow, I'm not 100% sure, but I think even simply browsing via FF is doing disk IO, I get these little tiny hiccups sometimes, this could however be related to the fact Xorg is choking.
Also I wonder if maybe the GPU is suffering as a result of dom0, and by extension the vega driver being limited to 1 core?
Marek: Can it be related to (lack of) NUMA support in Qubes?
Qubes or Xen? I was sort of hoping 4.15 would have this fixed.
Yeah playing videos isn't working at all, 1 frame per second?
I think I was saying it was smooth before because there wasn't much disk IO going on, with disk IO it appears to be choppy.
it won't really matter to gpu as such, but do you have Xorg-x11-drv-amdgpu installed? That aside, I'd suggest upping the max_vcpus to 3 or 4.
Yeah just installed it, isn't really making a difference.
It's weird, if I assign 2 or 4 cores to dom0, I get much, much worse performance, basically unusable performance. I'm going to put it down to Xen support with AMD based processors and hope 4.15 I can drop the core limit.
Although it can be manually entered in grub's runtime menu, the grub2-mkconfig
command does not allow dom0_vcpus_pin
and won't update the /boot config. It will say 'command not found'.
I tried with dom0_max_cpus=2 and it ran like hot garbage btw. Setting to '1' was much better.
Hm, that should be working, I didn't get any errors when setting it?
Even with it set, are you noticing weird little hiccups? I'm hoping Xen is working on better handling with AMD CPUs.
Not sure if I'm getting performance issues because dom0 only has 1 core, or if it's Xen.
If you give an appVM only 2 vcores it seems to be a little more stable
-------- Original Message -------- On Sep 18, 2020, 8:53 PM, tasket wrote:
Although it can be manually entered in grub's runtime menu, the grub2-mkconfig command does not allow dom0_vcpus_pin and won't update the /boot config. It will say 'command not found'.
I tried with dom0_max_cpus=2 and it ran like hot garbage btw. Setting to '1' was much better.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I resolved it by using the assignment syntax dom0_vcpus_pin=1
instead. I think its a parsing bug.
Edit: I haven't yet pushed any 4.1 VMs hard on my system, so I don't have a clear idea of how they perform at this point.
The 5.8.12-200 kernel update results in a system lockup on boot (after KDE plasma logo appears), but the 5.8.11-200 kernel works. Here is log output from the failed boot:
Oct 02 12:42:28 dom0 kernel: BUG: kernel NULL pointer dereference, address: 00000000000003a8
Oct 02 12:42:28 dom0 kernel: #PF: supervisor read access in kernel mode
Oct 02 12:42:28 dom0 kernel: #PF: error_code(0x0000) - not-present page
Oct 02 12:42:28 dom0 kernel: PGD 0 P4D 0
Oct 02 12:42:28 dom0 kernel: Oops: 0000 [#1] SMP NOPTI
Oct 02 12:42:28 dom0 kernel: CPU: 0 PID: 3468 Comm: Xorg Tainted: G W 5.8.12-200.fc32.x86_64 #1
Oct 02 12:42:28 dom0 kernel: Hardware name: LENOVO 20UDCTO1WW/20UDCTO1WW, BIOS R1BET36W(1.05 ) 06/11/2020
Oct 02 12:42:28 dom0 kernel: RIP: e030:mmu_interval_notifier_remove+0x16/0x140
Oct 02 12:42:28 dom0 kernel: Code: c5 74 e1 48 89 e6 48 89 ef e8 c6 bb e4 ff eb a6 0f 1f 40 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 83 ec 28 4c 8b 67 38 <49> 8b 9c 24 a8 03 00 00 e8 ad 0b 89 00 4c 8d 6b 0c 4c 89 ef e8 c1
Oct 02 12:42:28 dom0 kernel: RSP: e02b:ffffc900010a7d30 EFLAGS: 00010286
Oct 02 12:42:28 dom0 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
Oct 02 12:42:28 dom0 kernel: RDX: 0000000000000001 RSI: ffffffff81b716e0 RDI: ffff88803b8cf000
Oct 02 12:42:28 dom0 kernel: RBP: ffff88803b8cf000 R08: 7fffffffffffffff R09: 0000000000000000
Oct 02 12:42:28 dom0 kernel: R10: ffff88802a1bf2a0 R11: ffff8880652c43b0 R12: 0000000000000000
Oct 02 12:42:28 dom0 kernel: R13: 00000000fffffffc R14: ffff8880467011c0 R15: ffff8880467011d0
Oct 02 12:42:28 dom0 kernel: FS: 00007f8118c92a40(0000) GS:ffff88807d000000(0000) knlGS:0000000000000000
Oct 02 12:42:28 dom0 kernel: CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 02 12:42:28 dom0 kernel: CR2: 00000000000003a8 CR3: 00000000054ea000 CR4: 0000000000040660
Oct 02 12:42:28 dom0 kernel: Call Trace:
Oct 02 12:42:28 dom0 kernel: gntdev_mmap+0x275/0x318 [xen_gntdev]
Oct 02 12:42:28 dom0 kernel: mmap_region+0x43e/0x6e0
Oct 02 12:42:28 dom0 kernel: do_mmap+0x42f/0x540
Oct 02 12:42:28 dom0 kernel: vm_mmap_pgoff+0xb0/0xf0
Oct 02 12:42:28 dom0 kernel: ksys_mmap_pgoff+0x18a/0x250
Oct 02 12:42:28 dom0 kernel: do_syscall_64+0x4d/0x90
Oct 02 12:42:28 dom0 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 02 12:42:28 dom0 kernel: RIP: 0033:0x7f811917b526
Oct 02 12:42:28 dom0 kernel: Code: 01 00 66 90 f3 0f 1e fa 41 f7 c1 ff 0f 00 00 75 2b 55 48 89 fd 53 89 cb 48 85 ff 74 37 41 89 da 48 89 ef b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 62 5b 5d c3 0f 1f 80 00 00 00 00 48 8b 05 39
Oct 02 12:42:28 dom0 kernel: RSP: 002b:00007fff2cc0c8f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
Oct 02 12:42:28 dom0 kernel: RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f811917b526
Oct 02 12:42:28 dom0 kernel: RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
Oct 02 12:42:28 dom0 kernel: RBP: 0000000000000000 R08: 0000000000000009 R09: 0000000000000000
Oct 02 12:42:28 dom0 kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fff2cc0c910
Oct 02 12:42:28 dom0 kernel: R13: 0000000000000001 R14: 0000000000000009 R15: 0000000000000001
Oct 02 12:42:28 dom0 kernel: Modules linked in: fuse snd_seq_dummy snd_hrtimer loop nf_tables nfnetlink vfat fat snd_acp3x_rn snd_soc_dmic snd_acp3x_pdm_dma snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine tps6598x roles iwlwifi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi rapl snd_hda_intel snd_intel_dspcfg snd_hda_codec cfg80211 joydev wmi_bmof snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm k10temp snd_rn_pci_acp3x sp5100_tco thinkpad_acpi i2c_piix4 snd_pci_acp3x ipmi_devintf r8169 snd_timer ucsi_acpi ipmi_msghandler typec_ucsi ledtrig_audio snd typec soundcore rfkill i2c_scmi i2c_multi_instantiate xenfs ip_tables dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt mmc_block amdgpu iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper rtsx_pci_sdmmc mmc_core cec drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvme serio_raw rtsx_pci ccp nvme_core wmi video pinctrl_amd hid_logitech_dj xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn
Oct 02 12:42:28 dom0 kernel: uinput
Oct 02 12:42:28 dom0 kernel: CR2: 00000000000003a8
Oct 02 12:42:28 dom0 kernel: ---[ end trace 099ca5886879f3a7 ]---
Oct 02 12:42:28 dom0 kernel: RIP: e030:mmu_interval_notifier_remove+0x16/0x140
Oct 02 12:42:28 dom0 kernel: Code: c5 74 e1 48 89 e6 48 89 ef e8 c6 bb e4 ff eb a6 0f 1f 40 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 83 ec 28 4c 8b 67 38 <49> 8b 9c 24 a8 03 00 00 e8 ad 0b 89 00 4c 8d 6b 0c 4c 89 ef e8 c1
Oct 02 12:42:28 dom0 kernel: RSP: e02b:ffffc900010a7d30 EFLAGS: 00010286
Oct 02 12:42:28 dom0 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
Oct 02 12:42:28 dom0 kernel: RDX: 0000000000000001 RSI: ffffffff81b716e0 RDI: ffff88803b8cf000
Oct 02 12:42:28 dom0 kernel: RBP: ffff88803b8cf000 R08: 7fffffffffffffff R09: 0000000000000000
Oct 02 12:42:28 dom0 kernel: R10: ffff88802a1bf2a0 R11: ffff8880652c43b0 R12: 0000000000000000
Oct 02 12:42:28 dom0 kernel: R13: 00000000fffffffc R14: ffff8880467011c0 R15: ffff8880467011d0
Oct 02 12:42:28 dom0 kernel: FS: 00007f8118c92a40(0000) GS:ffff88807d000000(0000) knlGS:0000000000000000
Oct 02 12:42:28 dom0 kernel: CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 02 12:42:28 dom0 kernel: CR2: 00000000000003a8 CR3: 00000000054ea000 CR4: 0000000000040660
Try adding following kernel flags (in whichever combination makes it work): idle=nomwait amdgpu.noretry=0 amdgpu.gpu_recovery=1 iommu=pt amd_iommu=fullflush rhgb rcu_nocbs=0-15 amdgpu.dc=1 I have (almost) the same laptop and went through a lot of hoops to make it work under Linux.
@marmarek can I just add that to GRUB?
Can it be related to (lack of) NUMA support in Qubes?
Where can I look to start fixing this / supporting NUMA nodes for Qubes?
I'm still experiencing little micro lockups when trying to watch a YouTube video for example, this may have something to do with the Ryzen CPU sleeping when it shouldn't be, any UI animation/scrolling suffers from stuttering my CMDLINE is
placeholder root=/dev/mapper/qubes_dom0-root ro rd.luks.uuid=<UUID> rd.lvm.lv=qubes_dom0/root rd.lvm.lv=qubes_dom0/swap plymouth.ignore-serial-consoles processor.max_cstate=5 rd.driver.pre=btrfs idle=nomwait amdgpu.noretry=0 amdgpu.gpu_recovery=1 iommu=pt amd_iommu=fullflush rcu_nocbs=0-15 amdgpu.dc=1 rhgb quiet
I wonder if Xen isn't exposing MSRs that dom0 needs to set power management?
Can anyone else running this CPU confirm this little performance quirk?
@marmarek The 4.14.0-5 update was a miss. After updating, there was no way I could avoid rhythmic lurching and lagging in all the domUs. The effect only appears after a few minutes and was not quite as bad as having >1 vcpus assigned to dom0. But its still enough for me to mark this update as "bad".
After I downgraded back to the -4 version the system runs smoothly again. BTW, I'm using the 5.8.13 kernel now, and have added the ept=exec-sp
boot option to xen.
I also should note I tried @Yethal suggested params in various combinations but they had no affect on the lurching with the xen -5 update.
@dylangerdaly Since the major difference between -4 and -5 versions is the S3 handling patch, then power management does seem to be involved (even though I haven't been using suspend).
@tasket one of the changes between -4 and -5 was reverting this commit, as it breaks S3. While it is related to AMD processors, it isn't exactly obvious how it would make a difference in this case (it is focused on systems with a lot of cores, like 128 or 96). Maybe yet another side effect of this change...
Yeah I can confirm -5 is terrible, makes it unusable, I'll revert to -4 for now and try identify what's going on.
@tasket how are you downgrading packages?
Xen sucks. Even on -4 it's usable, but still there's always been micro-lockups, I suspect because support for Ryzen 4000 series CPUs.
I'll test tomorrow, but if it is related to this commit, can we not?
As Intel goes down for it's long nap AMD should start receiving first class support.
Huh, so I assigned 1 core an appVM, super duper smooth, UI animations etc are all really, really smooth.
So it's :100: that Xen can't seem to handle more than 1 Core on Ryzen 4000, I assume this is a CPU scheduling issue?
2 cores seems to be the sweet spot between being smooth and being usable
@dylangerdaly Sorry, I didn't notice your question. Going through my history, I downgraded with this:
$ sudo qubes-dom0-update --enablerepo=qubes*testing --action=downgrade xen-2001:4.14.0-4
$ sudo dnf downgrade xen-libs-4.14.0-4.fc32.x86_64.rpm python3-xen-4.14.0-4.fc32.x86_64.rpm xen-4.14.0-4.fc32.x86_64.rpm xen-hypervisor-4.14.0-4.fc32.x86_64.rpm xen-libs-4.14.0-4.fc32.x86_64.rpm xen-runtime-4.14.0-4.fc32.x86_64.rpm
IIRC in between those two commands I had to copy the xen packages from the updatevm into dom0.
@marmarek Does the 4.14.0-6 update contain the scheduling bug? If so, it will be unusable for us.
Yes, -6 is -5 + XSA applied (and one other fix, but unrelated to this issue).
Ugh, we'll now need to maintain a fork without the regression.
More investigation is required, the patch that you're reverting is important to people with AMD CPUs.
@dylangerdaly For now I have installed the dnf versionlock
extension and added a couple packages that keep the xen suite at -4:
$ sudo dnf versionlock add xen xen-libs xen-hypervisor
I've uploaded -6.1 with this revert reverted to the unstable repository, you can install it with:
sudo qubes-dom0-update --enablerepo=qubes-dom0-unstable --action=update xen-hypervisor
Thank you very much Marek!
Hey @marmarek, any chance we can get an updated xen-hypervisor
without the regression again?
I've updated Xen and am noticing the soft-locks.
This approach doesn't scale... Even if I upload "fixed" version again, the problem will return with every other update. I'll try to debug the base issue, but it's some nasty race condition in scheduler that is hard to track :/
I'll have a look at it today, try to help track it down
Sent from ProtonMail mobile
-------- Original Message -------- On 20 Nov 2020, 2:05 am, Marek Marczykowski-Górecki wrote:
This approach doesn't scale... Even if I upload "fixed" version again, the problem will return with every other update. I'll try to debug the base issue, but it's some nasty race condition in scheduler that is hard to track :/
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
There seems to be quite a few problems with Renoir CPUs and Xen at the moment. Besides this and the temporary fix working, my 4650u is stuck at 2.1GHz, and xenpm get-cpufreq-para says failed to get cpufreq parameter. Also, enabling SMT can sometimes makes Xen unable to startup/stop VMs, e.g. sys-firewall will start but other VMs won't after, sometimes sys-firewall won't start, sometimes its fine etc. Might be other performance related issues, but haven't noticed anything else.
Unfortunately, I do not believe serial connection for early debugging is possible, and at least on my model anyway, debugging over USB seems to not be possible, but I could be wrong. Is there any other info that we can provide which xen might be able to tell us? I haven't checked what xl dmesg and other commands output when booted without the command line fixes.
SMT is inherently insecure in the absence of a core scheduler, which Xen doesn’t have. So I would not worry about it.
Should I ask....
Is there evidence that AMD devs are helping Xen Project support their new products?
Should I ask....
Is there evidence that AMD devs are helping Xen Project support their new products?
yes, quite a bit. Though more so for the server products, for obvious reasons.
Another interesting thing I've noticed is when connected up to an external screen (4K), I'm hovering around 85C CPU temp, without the screen I'm around 45C.
I think the iGPU isn't configured correctly or I'm missing a x11 lib? Anyone else noticing this?
I've updated Xen and am noticing the soft-locks.
I think I misspoke here and was just stuttering because of my 4K display.
yes, quite a bit. Though more so for the server products, for obvious reasons.
Yeah I can imagine Mobile CPU aren't really looked at.
@tasket have you noticed when charging, it's idling ~75C compared to a cool ~45C on battery?
Not sure if it's a power management bug or if it's something physical.
@dylangerdaly My system hovers around 42C both on an off AC, with one or two FHD displays. Keep an eye on your CPU usage with xentop
; CPU stress is the only thing that seems to raise the temp.
BTW I'm using 5.8.18 in dom0 as the 5.9 kernels won't boot. My UEFI graphics RAM at the lowest setting (not auto-sized). I did install xorg-x11-drv-amdgpu
along with KDE and sddm. There was some method I used to check that x11 was using the AMD driver but I can't recall atm.
On edit: You might try changing graphics memory to the maximum in your UEFI settings, since 4K seems like it would demand more.
I'm glad you confirmed the Kernel issues, I'm having the exact same issue, newer kernels are boot looping.
Give play around with these settings, cheers @tasket
I just did a quick test on a 4K display and there was no temperature increase.
On other variable I can think of right now is the built-in Wifi, which I'm not using at all (not assigned to any running netvm). This would make the rest of the system go haywire after a while, so I'm relying on a NIC in sys-usb.
Qubes OS version 4.1
Affected component(s) or functionality Entire OS/Experience
Brief summary There appears to be something wrong with the CPU, every 3-5 seconds everything will lockup, here's a GIF for visuals.
I've confirmed this is specific to AMD 4000 CPUs because 4.1 running on a i7-1065G7 works fine (Still at a much slower rate than 4.0.3 but that's beside the point)
To Reproduce
Steps to reproduce the behavior:
Expected behavior Smooth as butter 8 Core experience
Actual behavior Terrible lockups every 3-5 seconds with full hangs peppered in randomly
Screenshots See GIF in Brief
Additional context NIL
Solutions you've tried Not sure how/where to troubleshoot this, I assume it has something to do with Xen.