acooks / tn40xx-driver

Linux driver for tn40xx from Tehuti Networks
73 stars 52 forks source link

PHY init failed on Linux 6.8 #73

Open cahz opened 3 months ago

cahz commented 3 months ago

With the latest develop version (which is required for Linux 6.8), I cannot get our TN9710P (with MV88X3310) to initialize.

Loading the module leads to the following output:

[  878.238757] tn40xx: Tehuti Network Driver from https://github.com/acooks/tn40xx-driver, linux-6.7.y-1
[  878.238761] tn40xx: Supported phys : MV88X3120 MV88X3310  QT2025 TLK10232 AQR105 MUSTANG 
[  878.238885] tn40xx 0000:02:00.0: srom 0x0 HWver 16 build 0 lane# 4 max_pl 0x2 mrrs 0x2
[  878.707481] tn40xx 0000:02:00.0: PHY init failed

I noticed that the check in tn.c:444 fails. Replacing the condition with !phy_id, it continues a bit further, but later fails:

[  878.347776] tn40xx 0000:02:00.0: PHY detected ID=2B09AA - MV88X3310 (A0) 10Gbps 10GBase-T
[  878.707473] MV88X3310 Initialization Error. Expected 0x000A, read 0xFFFF
TerminalAddict commented 3 months ago

same, upgrade to Proxmox 8.2 from 7.4 cause a failure

syslog:2024-08-19T16:03:40.094601+12:00 homeworld kernel: [  686.982348] tn40xx: Tehuti Network Driver from https://github.com/acooks/tn40xx-driver, linux-6.7.y-1
syslog:2024-08-19T16:03:40.094609+12:00 homeworld kernel: [  686.982350] tn40xx: Supported phys :    QT2025 TLK10232 AQR105 MUSTANG
syslog:2024-08-19T16:03:40.094609+12:00 homeworld kernel: [  686.982461] tn40xx 0000:01:00.0: srom 0x0 HWver 16 build 0 lane# 4 max_pl 0x1 mrrs 0x2
syslog:2024-08-19T16:03:40.095662+12:00 homeworld kernel: [  686.982598] tn40xx 0000:01:00.0: PHY init failed
qume commented 2 months ago

Same here with proxmox and TN9310 card.

[    9.106643] tn40xx 0000:01:00.0: enabling device (0000 -> 0002)
[    9.106777] tn40xx 0000:01:00.0: srom 0x0 HWver 16 build 0 lane# 4 max_pl 0x1 mrrs 0x2
[    9.106918] tn40xx 0000:01:00.0: PHY init failed
demonfoo commented 1 month ago

I tried booting Linux Mint 22 on a machine of mine with a StarTech ST10GSPEXNB NIC, and had the same problem; I've engaged in some troubleshooting, and found that changing line 444 of tn40.c to:

        if (phy_id == 0)

gets further, but it doesn't seem to like something it's doing in the bdx_mdio_set_speed() function, when it tries to call it during bdx_phy_init():

[12712.353749] tn40xx 0000:06:00.0: srom 0x0 HWver 16 build 0 lane# 4 max_pl 0x0 mrrs 0x2
[12712.353968] bdx_mdio_set_speed(): test 1; mdio_cfg is 00003ec0
[12712.353970] bdx_mdio_set_speed(): test 2; mdio_cfg is 00003ec8
[12712.455736] bdx_phy_init(): test 1, phy_type = 0x00000004
[12712.455743] tn40xx 0000:06:00.0: PHY detected ID=2B09AB - MV88X3310 (A1) 10Gbps 10GBase-T
[12712.455752] bdx_mdio_set_speed(): test 1; mdio_cfg is 00003ec8
[12712.455753] bdx_mdio_set_speed(): test 2; mdio_cfg is 00000a48
[12712.815491] MV88X3310 Initialization Error. Expected 0x000A, read 0xFFFF
[12712.815499] bdx_phy_init(): test 2
[12712.815501] tn40xx 0000:06:00.0: PHY init failed

Unfortunately I'm not sure what values it's expecting, but the NIC doesn't like what it's getting.

acooks commented 1 month ago

[12712.455743] tn40xx 0000:06:00.0: PHY detected ID=2B09AB - MV88X3310 (A1) 10Gbps 10GBase-T

Is there something unclear in the Readme about Marvell PHYs not being supportable? Or perhaps it isn't obvious to people when they have a Marvell PHY?

demonfoo commented 1 month ago

@acooks What? Prior to the change I made, it didn't provide any useful error. I have the firmware image, and based on nm output the firmware image for it is present in the .ko file. Are you saying those just don't work anymore at all? Or what?

acooks commented 1 month ago

I'm saying that the output you posted shows that you have an MV88X3310 phy, and those Marvell PHYs cannot be supported in this driver due to licensing issues, as I have already explained several times. Clearly the problem is in my explanation.

robanderson commented 1 month ago

I'm having the same issue here on Promox 8.2 running Linux 6.8.12-2-pve. Unfortunately I don't have a paid support subscription to ask Proxmox to sort as I'm just a home user trying to learn AI.

03:00.0 Ethernet controller [0200]: Tehuti Networks Ltd. TN9310 10GbE SFP+ Ethernet Adapter [1fc9:4022]
    Subsystem: Edimax Computer Co. 10 Gigabit Ethernet SFP+ PCI Express Adapter [1432:8103]
    Flags: fast devsel, IRQ 16, NUMA node 0, IOMMU group 52
    Memory at 3800ffe00000 (64-bit, prefetchable) [size=64K]
    Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
    Capabilities: [78] Power Management version 3
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [100] Virtual Channel
    Kernel modules: tn40xx

I have checked and the vendor and device ID do not (as yet) appear on the list of problem cards.

root@T7920-1:~# dmesg | grep tn40
[   29.907554] tn40xx: module verification failed: signature and/or required key missing - tainting kernel
[   29.909497] tn40xx: Tehuti Network Driver from https://github.com/acooks/tn40xx-driver, linux-6.7.y-1
[   29.909500] tn40xx: Supported phys :    QT2025 TLK10232 AQR105 MUSTANG 
[   29.917261] tn40xx 0000:03:00.0: srom 0x0 HWver 16 build 0 lane# 4 max_pl 0x0 mrrs 0x2
[   29.917458] tn40xx 0000:03:00.0: PHY init failed

Uname

root@T7920-1:~# uname -r
6.8.12-2-pve

ChatGPT seems to think there have been some changes to the NAPI functionality, but AI can hallucinate. So this might or might not be helpful.

Changes in the NAPI Interface in Kernel 6.8 napi_complete Replaced with napi_complete_done: In kernel 6.8, the napi_complete function has been replaced by napi_complete_done. The new function napi_complete_done requires an additional parameter: the number of packets processed (work_done). This change is intended to improve the NAPI polling mechanism by providing more accurate information about the amount of work completed. Possible Changes in the bdx_poll Function Signature: The napi_poll function signature may have changed, although in most kernels, it remains: c Copy code int (poll)(struct napi_struct napi, int budget); If the signature has changed in your kernel version, you will need to adjust it accordingly.

DatPat commented 1 month ago

I'm saying that the output you posted shows that you have an MV88X3310 phy, and those Marvell PHYs cannot be supported in this driver due to licensing issues, as I have already explained several times. Clearly the problem is in my explanation.

My understanding was that these nics could be supported if the appropriate firmware was provided prior to the complication process. Am I wrong in this?

DatPat commented 1 month ago

so I run a STLab N-480 and I have the following issue: bdx_mdio_scan_phy_id finds the phyid of 2b09ab on port 0 which looks like a valid value to me, however port 0 does not appear to be a valid port. phy_id = bdx_mdio_scan_phy_id(priv); / set phy_mdio_port /

if (!priv->phy_mdio_port){
    dev_err(&priv->pdev->dev, "No PHY detected on MDIO bus.");
    return PHY_TYPE_NA; /* No PHY detected on MDIO bus. */
}

There is an explicit check on the port being 0 so I am confused as to how to proceed as I know nothing about the hardware.

    i = bdx_mdio_look_for_phy(priv,*port_t);
    if (i >= 0)  // PHY  found

the original code has the index signed and thinks port(i) == 0 to be valid. I could really use some help here.

[ 3322.357350] tn40xx: Driver unloaded [ 3329.290170] tn40xx: Tehuti Network Driver from https://github.com/acooks/tn40xx-driver, linux-6.7.y-1 [ 3329.290173] tn40xx: Supported phys : MV88X3310 QT2025 TLK10232 AQR105 MUSTANG [ 3329.290302] tn40xx 0000:67:00.0: srom 0x0 HWver 16 build 0 lane# 2 max_pl 0x0 mrrs 0x2 [ 3329.398797] tn40xx 0000:67:00.0: phy_id 2b09ab [ 3329.398802] tn40xx 0000:67:00.0: priv->phy_mdio_port 0 [ 3329.398804] tn40xx 0000:67:00.0: PHY detected ID=2B09AB - MV88X3310 (A1) 10Gbps 10GBase-T [ 3329.758550] MV88X3310 Initialization port detected 0 [ 3332.974543] MV88X3310 initdata applied [ 3332.974639] MV88X3310 I/D version is 0.3.4.0 [ 3333.159878] tn40xx 0000:67:00.0 eth0: fw 0xe [ 3333.159890] tn40xx 0000:67:00.0 eth0: Port A [ 3333.159920] tn40xx 0000:67:00.0: 1 1fc9:4027:1fc9:3015 [ 3333.159955] tn40xx: detected 1 cards, 1 loaded [ 3333.161427] tn40xx 0000:67:00.0 enp103s0: renamed from eth0 pat@pat:~/Code/driver$ uname -r 6.8.0-45-generic

now it works, connectivity, data, everything.

I don't understand how port 0 can be valid on my card when this is clearly a cause for error in the driver as is.

Here are some infos to my card:

67:00.0 Ethernet controller [0200]: Tehuti Networks Ltd. TN9710P 10GBase-T/NBASE-T Ethernet Adapter [1fc9:4027] Subsystem: Tehuti Networks Ltd. Ethernet Adapter [1fc9:3015] Flags: bus master, fast devsel, latency 0, IRQ 204, IOMMU group 17 Memory at fc65300000 (64-bit, prefetchable) [size=64K] Capabilities: Kernel driver in use: tn40xx Kernel modules: tn40xx

DatPat commented 1 month ago

With the latest develop version (which is required for Linux 6.8), I cannot get our TN9710P (with MV88X3310) to initialize.

Loading the module leads to the following output:

[  878.238757] tn40xx: Tehuti Network Driver from https://github.com/acooks/tn40xx-driver, linux-6.7.y-1
[  878.238761] tn40xx: Supported phys : MV88X3120 MV88X3310  QT2025 TLK10232 AQR105 MUSTANG 
[  878.238885] tn40xx 0000:02:00.0: srom 0x0 HWver 16 build 0 lane# 4 max_pl 0x2 mrrs 0x2
[  878.707481] tn40xx 0000:02:00.0: PHY init failed

I noticed that the check in tn.c:444 fails. Replacing the condition with !phy_id, it continues a bit further, but later fails:

[  878.347776] tn40xx 0000:02:00.0: PHY detected ID=2B09AA - MV88X3310 (A0) 10Gbps 10GBase-T
[  878.707473] MV88X3310 Initialization Error. Expected 0x000A, read 0xFFFF

This is the exact issue I had, to fix this you need to set port to 0 on top of 'MV88X3310_mdio_reset'. when i did that it started working for me.