Syniurge / i2c-amd-mp2

DKMS-ready driver for AMD PCI-E MP2 I2C controllers
22 stars 4 forks source link

Iommu issue with touchpad driver #18

Closed RemyL closed 3 years ago

RemyL commented 3 years ago

Hello,

I'm reporting an issue about the driver in kernel (i don't know if this is the good place to report and if you are also maintaining the kernel driver;

Here is my laptop: lenovo Yoga 530-14ARR (81H9) with amd ryzen 2500U and Vega 8 integrated with bios up-to-date. I'm using fedora 33 with kernel 5.9.12

You will find all the informations that makes me come here: https://ask.fedoraproject.org/t/kernel-panic-because-of-amd-iommu-amd-vi/10945

So here is what i found: i2c_amd_mp2_plat causes issues with iommu in my laptop. When i do nothing a get kernel panic that doesn't show much. But by trying multiple configurations and checking the modules linked to the issue i found out that blacklisting i2c_amd_mp2_plat solve my iommu issue. However it deactivate the touchpad and touchscreen. Moreother, as i said in my fedora report, this module also cause a black screen issue when i resume from suspend (and i have to reboot with power button)

Do you have an idea on how to solve this ?

Thank your for all your work and for working on this driver for linux.

Edit 1: Today i tried different old kernels, here is the result:

I have no issue with 5.6.6 f32 kernel
I get an issue with 5.7.17 f32 kernel (but not the same i have)
I get same issue with 5.8.6 f32 kernel that i have now on 5.9.12 f33 kernel
Syniurge commented 3 years ago

Oh I missed the notification for this issue, sorry. I never checked how the driver behaves without iommu, and since you own a Yoga 530 too it's using DMA transfers for I2C messages so the IOMMU may be kinda important for DMA.

I run a recent kernel on mine (5.8 or 5.9, have to check) and no such issues with Kubuntu, it also resumes fine from suspend.

I don't have much for time for 2 or 3 days but before friday I'll look into it.

RemyL commented 3 years ago

Ok, thank you so much.

I will try to dual boot ubuntu with a kernel higher than 5.8 and see what i get. (if you can resume from suspend it means that the resume issue and the kernel panic are linked) Moreother, if you want to see my issue you can just boot a live fedora 33 and check dmesg result in terminal "sudo dmesg -Hw"

If you don't have this issue it means that one of my harware changes cause this (RAM/SSD/ or wifi) And to clarify the thing, i didn't deactivate iommu on purpose but because i get this isssue.

Moreother, it seems that we have same laptop. Wich bios version do you use ? And how do you deal with the 400MHz when battery is under 20%

Edit: I tried Kubuntu 20.10 (with stock kernel it's 5.8 or higher) and i get the exact same issue that i have in fedora. Do you have the lenovo yoga 530 14ARR ? and do you use stock kubuntu 20.10 kernel or do you build yours ?

Edit 2: I tried kubuntu 20.04-1 (whith stock 5.4 kernel) i still get a iommu issue but different 'AMD-Vi: Event logged [IO_PAGE_FAULT device=00:00.0 domain=0x0000 adress=[...] flags=0x0050]' but device resume fine after suspend

Edit 3: As we have same laptop, could you share your feeling when you use it on linux ? Does everythink works perfectly ? Do you have amdgpu backlight issue ? And do you have issue with touchpad when laptop is in charge ? I know that some of my questions are not related whith your touchpad module but it would really help me to improve my laptop on linux.

RemyL commented 3 years ago

Oh I missed the notification for this issue, sorry. I never checked how the driver behaves without iommu, and since you own a Yoga 530 too it's using DMA transfers for I2C messages so the IOMMU may be kinda important for DMA.

I run a recent kernel on mine (5.8 or 5.9, have to check) and no such issues with Kubuntu, it also resumes fine from suspend.

I don't have much for time for 2 or 3 days but before friday I'll look into it.

Did you have time to work a bit on this ? I tried to build kernel 5.10 myself with default settings and i still get the issue

ccryx commented 3 years ago

Hi, I'm pretty sure I have a similar issue on the same laptop (Lenovo Yoga 530 14-ARR), on Arch Linux with kernel 5.10.6-arch1-1 but for a while now, so probably similar to @RemyL

The issue is with missing IVRS table entries for IOAPIC[4] and IOAPIC[5]. IOAPIC[5] apparently corresponds to pci device 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU while IOAPIC[4] corresponds to 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61), though how someone might come to this conclusion is something that I have never quite understood.

However, by setting ... ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2 ... in your kernel command line you can get rid of the dumps and other combinations don't seem to work. The option for IOAPIC[4] results in the touchpad and touch screen not working.

The kernel will output the following messages:

i2c_amd_mp2 AMDI0011:00: initial bus enable failed
i2c_amd_mp2 AMDI0011:01: initial bus enable failed

So currently I'm living with just setting ivrs_ioapic[5]=00:00.2 and annoying logspam in tty.

I hope this helps.

RemyL commented 3 years ago

Hi, I'm pretty sure I have a similar issue on the same laptop (Lenovo Yoga 530 14-ARR), on Arch Linux with kernel 5.10.6-arch1-1 but for a while now, so probably similar to @RemyL

The issue is with missing IVRS table entries for IOAPIC[4] and IOAPIC[5]. IOAPIC[5] apparently corresponds to pci device 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU while IOAPIC[4] corresponds to 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61), though how someone might come to this conclusion is something that I have never quite understood.

However, by setting ... ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2 ... in your kernel command line you can get rid of the dumps and other combinations don't seem to work. The option for IOAPIC[4] results in the touchpad and touch screen not working.

The kernel will output the following messages:

i2c_amd_mp2 AMDI0011:00: initial bus enable failed
i2c_amd_mp2 AMDI0011:01: initial bus enable failed

So currently I'm living with just setting ivrs_ioapic[5]=00:00.2 and annoying logspam in tty.

I hope this helps.

Can you share what you get when you do 'sudo dmesg -Hw' after booting ? For me it's not only a logspam because i also can't resume from suspend when i keep it default.

ccryx commented 3 years ago

I saved the dmesg output for the following scenarios:

  1. module i2c_amd_mp2_plat not blacklisted, no ivrs_* kernel command line option
  2. module i2c_amd_mp2_plat not blacklisted, ivrs_ioapic[5] kernel command line option
  3. module i2c_amd_mp2_plat not blacklisted, both ivrs_* kernel command line options
  4. module i2c_amd_mp2_plat blacklisted, no ivrs_* kernel command line option
  5. module i2c_amd_mp2_plat blacklisted, ivrs_ioapic[5] kernel command line option
  6. module i2c_amd_mp2_plat blacklisted, both ivrs_* kernel command line options

Scenarios 3-6 result in no dumps but also no touchpad/-screen input. I think this is because the module is causing the dumps and scenario 3 prevents the module from initializing. Only scenarios 4-6 (i.e. blacklisting the module) allow me to resume from suspend. Only scenarios 3 and 6 (i.e. both ivrs_* kernel command line options) make their respective errors go away, them being:

AMD-Vi: [Firmware Bug]: : IOAPIC[4] not in IVRS table
AMD-Vi: [Firmware Bug]: : IOAPIC[5] not in IVRS table
AMD-Vi: [Firmware Bug]: : No southbridge IOAPIC found

So I think that while the ivrs_* parameters are necessary for a smooth ride, they are mostly unrelated to this problem. The absence of dumps is only a side effect of ivrs_ioapic[4] breaking the i2c_amd_mp2 module.

Edit: As a side node, I am currently using the module that ships with Arch's kernel but blacklisting that and using the one from git via dkms makes no difference as far as I can tell.

blacklist_no_ivrs_opts.log blacklist_one_ivrs_opt.log blacklist_two_ivrs_opts.log noblacklist_both_ivrs_opts.log noblacklist_no_ivrs_opts.log noblacklist_one_ivrs_opt.log

RemyL commented 3 years ago

I saved the dmesg output for the following scenarios:

1. module `i2c_amd_mp2_plat` not blacklisted, no `ivrs_*` kernel command line option

2. module `i2c_amd_mp2_plat` not blacklisted, `ivrs_ioapic[5]` kernel command line option

3. module `i2c_amd_mp2_plat` not blacklisted, both `ivrs_*` kernel command line options

4. module `i2c_amd_mp2_plat` blacklisted, no `ivrs_*` kernel command line option

5. module `i2c_amd_mp2_plat` blacklisted, `ivrs_ioapic[5]` kernel command line option

6. module `i2c_amd_mp2_plat` blacklisted, both `ivrs_*` kernel command line options

Scenarios 3-6 result in no dumps but also no touchpad/-screen input. I think this is because the module is causing the dumps and scenario 3 prevents the module from initializing. Only scenarios 4-6 (i.e. blacklisting the module) allow me to resume from suspend. Only scenarios 3 and 6 (i.e. both ivrs_* kernel command line options) make their respective errors go away, them being:

AMD-Vi: [Firmware Bug]: : IOAPIC[4] not in IVRS table
AMD-Vi: [Firmware Bug]: : IOAPIC[5] not in IVRS table
AMD-Vi: [Firmware Bug]: : No southbridge IOAPIC found

So I think that while the ivrs_* parameters are necessary for a smooth ride, they are mostly unrelated to this problem. The absence of dumps is only a side effect of ivrs_ioapic[4] breaking the i2c_amd_mp2 module.

Edit: As a side node, I am currently using the module that ships with Arch's kernel but blacklisting that and using the one from git via dkms makes no difference as far as I can tell.

blacklist_no_ivrs_opts.log blacklist_one_ivrs_opt.log blacklist_two_ivrs_opts.log noblacklist_both_ivrs_opts.log noblacklist_no_ivrs_opts.log noblacklist_one_ivrs_opt.log

Ok Thx for your answer and all your tests. So yes we get the same issues. I also tried the module from github (blacklisting kernel and using dkms) and like you i got the same results. I agree with you about the ivrs_* parameters, maybe solving the issue may allow us to us those parameters after it's solved. I tried to investigate by looking at the module files but i'm not good enough to understand what it does and solve the thing

RemyL commented 3 years ago

@ccryx can you try any live usb with kernel higher than 5.9 kernel ? (Fedora 33, debian testing, manjaro, ..) and tell me if you get same issue and share dmesg log ?

@Syniurge I continued looking on this issue. I don't know if the issues commes from this module or iommu. Here is what i found: It seems to be a pci error caused by: "amd_iommu_int_thread". I found the code lines here: https://github.com/torvalds/linux/blob/master/drivers/iommu/amd/iommu.c from line 819 to 862

RemyL commented 3 years ago

@ccryx @Syniurge Well after some investigation i found the the issue appeared after this commit about amd/iommu in kernel: https://gitlab.com/linux-kernel/stable/-/commit/05a0542b456e135f362ba83a17ccff73bac0b92f What i don't know is if this commit is causing the issue and need to be reverted or only if it means that this driver needs a rewrite. I will try to check more about it if i have time.

RemyL commented 3 years ago

For those that are interested, here is the answer from Syniurge on linux kernel bugzilla where i posted the issue: https://bugzilla.kernel.org/show_bug.cgi?id=211241 So the issue is caused by iommu that doesn't handle fault pages caused by wacom device (touchscreen and sensor) Thx to Syniruge here is a temporary solution:

blacklist wacom module, here is what i'm going to do: blacklist wacom module on grub: modprobe.blacklist=wacom If i need touchscreen i will activate it again but using "sudo modprobe wacom".

Edit: Well i reinstalled fedora 33 and blacklisted wacom module. After that, the issue is gone AND touchscreen still works (don't know why)

After those modifications, it's possible to resume from suspend again. However everything is still not perfect as backlight issue on amdgpu also cause issue after a long suspend

I guess that this issue will be solved soon. One last question for you @Syniurge: Do you also have amdgpu issue that also cause impossible resume after a long suspend ? An other issue that i have is that sometime screen rotation sensor doesn't initiate at boot but i can't find any log, do you know what i can do for that ?

Thank you again for your help and all your work

Syniurge commented 3 years ago

Fixed by the iommu maintainer in the 5.11 kernel, thanks @RemyL.

One last question for you @Syniurge: Do you also have amdgpu issue that also cause impossible resume after a long suspend ?

Not in recent memory, but I don't use suspend that much on the Yoga 530.

An other issue that i have is that sometime screen rotation sensor doesn't initiate at boot but i can't find any log, do you know what i can do for that ?

Sorry I only checked monitor-sensor a couple of times ever, I don't make use of screen rotation.