Aquantia / AQtion

Aquantia AQC multigigabit NIC linux driver (atlantic) - development preview
https://www.aquantia.com
82 stars 28 forks source link

Kernel panic after resume from suspend #22

Closed CySlider closed 3 years ago

CySlider commented 3 years ago

I have a strange issue with all kernels past 5.4 on Manjaro with this module. A few seconds after resuming from suspend I get this error:

Nov 23 22:01:55 **** kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Nov 23 22:01:55 **** kernel: #PF: supervisor read access in kernel mode
Nov 23 22:01:55 **** kernel: #PF: error_code(0x0000) - not-present page
Nov 23 22:01:55 **** kernel: PGD 0 P4D 0 
Nov 23 22:01:55 **** kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Nov 23 22:01:55 **** kernel: CPU: 0 PID: 1551 Comm: NetworkManager Tainted: P           OE     5.9.10-1-MANJARO #1
Nov 23 22:01:55 **** kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Professional Gaming, BIOS P3.30 01/15/2018
Nov 23 22:01:55 **** kernel: RIP: 0010:aq_ring_rx_fill+0xd1/0x200 [atlantic]
Nov 23 22:01:55 **** kernel: Code: 45 24 ba 00 00 00 00 83 c0 01 3b 45 28 48 0f 43 c2 89 45 24 41 83 ee 01 0f 84 f3 00 00 00 48 8d 1c 40 48 c1 e3 04 48 03 5d 00 <48> 8b 43 08 48 c7 43 28 00 08 00 00 48 85 c0 75 85 48 8b 45 10 31
Nov 23 22:01:55 **** kernel: RSP: 0018:ffffa4e5955af390 EFLAGS: 00010246
Nov 23 22:01:55 **** kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Nov 23 22:01:55 **** kernel: RDX: 0000000000000000 RSI: 0000000000006100 RDI: ffffa03b126d83b8
Nov 23 22:01:55 **** kernel: RBP: ffffa03b126d83b8 R08: 0000000000000000 R09: 0000000000008000
Nov 23 22:01:55 **** kernel: R10: 00000000ffffffff R11: fffffa337cc292c0 R12: 0000000000001000
Nov 23 22:01:55 **** kernel: R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000000
Nov 23 22:01:55 **** kernel: FS:  00007f1d5f2e48c0(0000) GS:ffffa03b1ee00000(0000) knlGS:0000000000000000
Nov 23 22:01:55 **** kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 22:01:55 **** kernel: CR2: 0000000000000008 CR3: 0000000fa5e8e000 CR4: 00000000003506f0
Nov 23 22:01:55 **** kernel: Call Trace:
Nov 23 22:01:55 **** kernel:  aq_vec_init+0x8c/0xf0 [atlantic]
Nov 23 22:01:55 **** kernel:  aq_nic_init+0xc3/0x1c0 [atlantic]
Nov 23 22:01:55 **** kernel:  aq_ndev_open+0x19/0x60 [atlantic]
Nov 23 22:01:55 **** kernel:  __dev_open+0xfb/0x1b0
Nov 23 22:01:55 **** kernel:  __dev_change_flags+0x1a5/0x210
Nov 23 22:01:55 **** kernel:  dev_change_flags+0x21/0x60
Nov 23 22:01:55 **** kernel:  do_setlink+0x2bc/0x1160
Nov 23 22:01:55 **** kernel:  ? __nla_validate_parse+0x5f/0x910
Nov 23 22:01:55 **** kernel:  __rtnl_newlink+0x65f/0x9e0
Nov 23 22:01:55 **** kernel:  rtnl_newlink+0x44/0x70
Nov 23 22:01:55 **** kernel:  rtnetlink_rcv_msg+0x13e/0x390
Nov 23 22:01:55 **** kernel:  ? rtnl_calcit.isra.0+0x120/0x120
Nov 23 22:01:55 **** kernel:  netlink_rcv_skb+0x75/0x140
Nov 23 22:01:55 **** kernel:  netlink_unicast+0x242/0x340
Nov 23 22:01:55 **** kernel:  netlink_sendmsg+0x243/0x480
Nov 23 22:01:55 **** kernel:  sock_sendmsg+0x5e/0x60
Nov 23 22:01:55 **** kernel:  ____sys_sendmsg+0x25a/0x2a0
Nov 23 22:01:55 **** kernel:  ? copy_msghdr_from_user+0x6e/0xa0
Nov 23 22:01:55 **** kernel:  ___sys_sendmsg+0x97/0xe0
Nov 23 22:01:55 **** kernel:  __sys_sendmsg+0x81/0xd0
Nov 23 22:01:55 **** kernel:  do_syscall_64+0x33/0x40
Nov 23 22:01:55 **** kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 23 22:01:55 **** kernel: RIP: 0033:0x7f1d5fff0ddd
Nov 23 22:01:55 **** kernel: Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 4a ee ff ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 9e ee ff ff 48
Nov 23 22:01:55 **** kernel: RSP: 002b:00007ffd242d4bc0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
Nov 23 22:01:55 **** kernel: RAX: ffffffffffffffda RBX: 000056439a8ca030 RCX: 00007f1d5fff0ddd
Nov 23 22:01:55 **** kernel: RDX: 0000000000000000 RSI: 00007ffd242d4c00 RDI: 000000000000000c
Nov 23 22:01:55 **** kernel: RBP: 0000000000000051 R08: 0000000000000000 R09: 0000000000000000
Nov 23 22:01:55 **** kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
Nov 23 22:01:55 **** kernel: R13: 00007ffd242d4d50 R14: 00007ffd242d4d4c R15: 0000000000000000
Nov 23 22:01:55 **** kernel: Modules linked in: fuse cmac algif_hash algif_skcipher af_alg bnep nct6775 hwmon_vid dm_crypt cbc encrypted_keys trusted tpm squashfs input_leds joydev mousedev hid_plantronics hid_steam btusb btrtl btbcm btintel bluetooth ecdh_generic ecc hid_generic nls_iso8859_1 nls_cp437 vfat fat loop snd_usb_audio snd_usbmidi_lib usbhid snd_rawmidi snd_seq_device hid wmi_bmof mxm_wmi edac_mce_amd kvm_amd amdgpu kvm zfs(POE) irqbypass crct>
Nov 23 22:01:55 **** kernel:  pinctrl_amd gpio_amdpt acpi_cpufreq zcommon(POE) znvpair(POE) spl(OE) uinput vboxnetflt(OE) vboxnetadp(OE) nfsd vboxdrv(OE) auth_rpcgss nfs_acl lockd grace videodev drm sunrpc mc sg crypto_user agpgart nfs_ssc ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 uas usb_storage crc32c_intel xhci_pci sr_mod xhci_hcd cdrom
Nov 23 22:01:55 **** kernel: CR2: 0000000000000008
Nov 23 22:01:55 **** kernel: ---[ end trace 08ae79741d9a6dcf ]---
Nov 23 22:01:55 **** kernel: RIP: 0010:aq_ring_rx_fill+0xd1/0x200 [atlantic]
Nov 23 22:01:55 **** kernel: Code: 45 24 ba 00 00 00 00 83 c0 01 3b 45 28 48 0f 43 c2 89 45 24 41 83 ee 01 0f 84 f3 00 00 00 48 8d 1c 40 48 c1 e3 04 48 03 5d 00 <48> 8b 43 08 48 c7 43 28 00 08 00 00 48 85 c0 75 85 48 8b 45 10 31
Nov 23 22:01:55 **** kernel: RSP: 0018:ffffa4e5955af390 EFLAGS: 00010246
Nov 23 22:01:55 **** kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Nov 23 22:01:55 **** kernel: RDX: 0000000000000000 RSI: 0000000000006100 RDI: ffffa03b126d83b8
Nov 23 22:01:55 **** kernel: RBP: ffffa03b126d83b8 R08: 0000000000000000 R09: 0000000000008000
Nov 23 22:01:55 **** kernel: R10: 00000000ffffffff R11: fffffa337cc292c0 R12: 0000000000001000
Nov 23 22:01:55 **** kernel: R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000000
Nov 23 22:01:55 **** kernel: FS:  00007f1d5f2e48c0(0000) GS:ffffa03b1ee00000(0000) knlGS:0000000000000000
Nov 23 22:01:55 **** kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 22:01:55 **** kernel: CR2: 0000000000000008 CR3: 0000000fa5e8e000 CR4: 00000000003506f0

I tried it with every stable kernel after 5.4 and get the same. 5.4 however is rock solid.

If I run sudo rmmod atlantic before I suspend, this error does not happen.

I should add that soon after this the whole system freezes up, and also shutdown does never finsh

cail commented 3 years ago

Thanks for the report, will review that. Have you tried the same with inbox driver? Mean the atlantic driver from kernel?

CySlider commented 3 years ago

Not sure, what you mean. I am working on it with someone else on it who seems to know his way around here:

https://forum.manjaro.org/t/all-kernel-after-5-4-crash-on-me-after-suspend-sleep/36431/20

Will continue this after work in a few hours.

CySlider commented 3 years ago

Sorry, after understanding far more about this topic, I get that I am using an in kernel verison and not this module. It seems to me, your master has most likely already a fix for my issue.

This is how my kernel code looks like in the resume function:

       if (deep) {
        ret = aq_nic_init(nic);
        if (ret)
            goto err_exit;
    }

    if (netif_running(nic->ndev)) {
        ret = aq_nic_start(nic);
        if (ret)
            goto err_exit;
    }

VS your master code

if (aq_utils_obj_test(&nic->aq_hw->flags, AQ_HW_FLAG_STARTED)) {
        ret = aq_nic_init(nic);
        if (ret)
            goto err_exit;

        ret = aq_nic_start(nic);
        if (ret)
            goto err_exit;
}

My version seems to initalize stuff twice if the nic feature flags are changed after the resume code happens. Most likely this is not an issue you have to fix anymore.

CySlider commented 3 years ago

Installing 2.4.7 via DKMS solved this issue for me. Sorry for bothering you with it.

CySlider commented 3 years ago

I was asked to ask you, if you could upstream the newer version or a fix for the current upstream version, to solve this.

The summary of the issue is this one:

pobrn: I believe the problem is that

aq_pm_resume_restore()
  -> atl_resume_common(deep=true)
    -> aq_nic_init()

and

aq_ndev_open()
  -> aq_nic_init()

so the device will be initialized twice after resume, which causes its internal data structures to be in an invalid state, therefore causing the NULL pointer dereference in the second call to aq_nic_init() .

I believe the reason it works for the first time is that - as the logs indicate - netif_running() returns true , thus I figure the netdev core thinks that the device is “running” or in some kind of started state, and thus it will not call aq_ndev_open() after the first resume, therefore aq_nic_init() is called only once, everything is fine. But after the second resume, netif_running() is seemingly false , and I believe that indicates that the netdev core thinks the device is in some kind of “stopped” state, thus it calls aq_ndev_open() down the line, causing the second call to aq_nic_init() , causing the NULL pointer dereference.

cail commented 3 years ago

Thanks for confirmation, we'll schedule this fix for the in-kernel version.