enclustra-bsp / xilinx-linux

Other
12 stars 15 forks source link

Kernel panic, potential race condition on cadence net driver or phy driver #4

Closed j4munoz closed 5 years ago

j4munoz commented 5 years ago

I got the following trace:

[ 286.677886] macb ff0b0000.ethernet: gem-ptp-timer ptp clock unregistered. [ 286.698392] net eth0: PHY already attached [ 287.203848] Unable to handle kernel NULL pointer dereference at virtual address 00000048 [ 287.211860] Mem abort info: [ 287.214628] Exception class = DABT (current EL), IL = 32 bits [ 287.220540] SET = 0, FnV = 0 [ 287.223567] EA = 0, S1PTW = 0 [ 287.226699] Data abort info: [ 287.229563] ISV = 0, ISS = 0x00000006 [ 287.233382] CM = 0, WnR = 0 [ 287.236426] user pgtable: 4k pages, 39-bit VAs, pgd = ffffffc06dab8000 [ 287.242873] [0000000000000048] pgd=000000006ceec003, pud=000000006ceec003, *pmd=0000000000000000 [ 287.251811] Internal error: Oops: 96000006 [#1] SMP [ 287.256665] Modules linked in: [ 287.259706] CPU: 1 PID: 15 Comm: kworker/1:0 Tainted: G W 4.14.0-00006-g9dd4a8a0b-dirty #11 [ 287.269164] Hardware name: Enclustra XU1 SOM (DT) [ 287.273860] Workqueue: events_power_efficient phy_state_machine [ 287.279754] task: ffffffc06d8d2480 task.stack: ffffff80090b8000 [ 287.285660] PC is at test_and_set_bit+0x18/0x38 [ 287.290173] LR is at netif_carrier_off+0x1c/0x68 [ 287.294771] pc : [ffffff80089e7908] lr : [ffffff800880c61c] pstate: 00000145 [ 287.302148] sp : ffffff80090bbd50 [ 287.305446] x29: ffffff80090bbd50 x28: 0000000000000000 [ 287.310741] x27: 0000000000000000 x26: ffffff8008bcf460 [ 287.316036] x25: ffffff80080b2428 x24: ffffffc06ced23d8 [ 287.321331] x23: ffffffc06ced2440 x22: ffffffc06ced2000 [ 287.326626] x21: ffffffc06ced2000 x20: 0000000000000000 [ 287.331921] x19: 0000000000000000 x18: ffffffc06ff92ee0 [ 287.337215] x17: 0000000000000000 x16: 0000000000000000 [ 287.342510] x15: ffffff8008d88000 x14: 0000000000000000 [ 287.347805] x13: ffffffc06ff92e80 x12: 00000042deaea08e [ 287.353100] x11: 0000000000000000 x10: 0000000000000880 [ 287.358395] x9 : ffffff80090bbd90 x8 : ffffffc06d8d2d60 [ 287.363690] x7 : ffffffc06d8d2600 x6 : 0000000000000000 [ 287.368985] x5 : 0000000000000001 x4 : 0000000000000004 [ 287.374279] x3 : 0000000000000002 x2 : 0000000000000001 [ 287.379574] x1 : 0000000000000048 x0 : 0000000000000000 [ 287.384870] Process kworker/1:0 (pid: 15, stack limit = 0xffffff80090b8000) [ 287.391813] Call trace: [ 287.394245] Exception stack(0xffffff80090bbc10 to 0xffffff80090bbd50) [ 287.400668] bc00: 0000000000000000 0000000000000048 [ 287.408481] bc20: 0000000000000001 0000000000000002 0000000000000004 0000000000000001 [ 287.416293] bc40: 0000000000000000 ffffffc06d8d2600 ffffffc06d8d2d60 ffffff80090bbd90 [ 287.424105] bc60: 0000000000000880 0000000000000000 00000042deaea08e ffffffc06ff92e80 [ 287.431917] bc80: 0000000000000000 ffffff8008d88000 0000000000000000 0000000000000000 [ 287.439729] bca0: ffffffc06ff92ee0 0000000000000000 0000000000000000 ffffffc06ced2000 [ 287.447542] bcc0: ffffffc06ced2000 ffffffc06ced2440 ffffffc06ced23d8 ffffff80080b2428 [ 287.455354] bce0: ffffff8008bcf460 0000000000000000 0000000000000000 ffffff80090bbd50 [ 287.463166] bd00: ffffff800880c61c ffffff80090bbd50 ffffff80089e7908 0000000000000145 [ 287.470978] bd20: ffffff80090bbdb0 ffffff80089fea1c ffffffffffffffff ffffffc06d8d2480 [ 287.478789] bd40: ffffff80090bbd50 ffffff80089e7908 [ 287.483651] [ffffff80089e7908] test_and_set_bit+0x18/0x38 [ 287.489206] [ffffff8008600fbc] phy_link_change+0x2c/0x68 [ 287.494675] [ffffff80085ff470] phy_state_machine+0x1d8/0x610 [ 287.500493] [ffffff80080b22bc] process_one_work+0x1dc/0x348 [ 287.506219] [ffffff80080b2470] worker_thread+0x48/0x488 [ 287.511601] [ffffff80080b8104] kthread+0x12c/0x130 [ 287.516549] [ffffff8008084a90] ret_from_fork+0x10/0x18 [ 287.521844] Code: d2800022 8b400c21 f9800031 9ac32044 (c85f7c22) [ 287.527919] ---[ end trace b060ced5458db831 ]---

It is very similar to https://lore.kernel.org/patchwork/patch/1037164/, however for the XU1 we use the cadence driver, not the hns3. I've seen that Torvalds has some modifications in the cadence macb_main.c and macb_ptp.c files in the master branch of the Kernel.

j4munoz commented 5 years ago

Hello, I have taken the next files from the Linux git repo tag "v4.14.106" (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/?h=v4.14.106):

After that, recompiled the Kernel and I did not get any kernel crashes anymore. However, I did not tested it thoroughly. I took the idea from (https://lore.kernel.org/patchwork/patch/831639/), where it states a revert from a commit for the phy.c code. Hence I thought that by updating the phy-related sources it could be fixed.

tgorochowik commented 5 years ago

Thank you for the report @j4munoz !

We merged our master branch to the latest release from Xilinx, and it includes the fix you mentioned (see: https://github.com/enclustra-bsp/xilinx-linux/commit/ebc8254aeae34226d0bc8fda309fd9790d4dccfe).

You can use the master branch of our repo - the fix will be included in the next release of the BSP.