Closed geerlingguy closed 3 weeks ago
Interestingly, the SOL console in the BMC is spitting errors like:
[ 11.232990] Unable to handle kernel paging request at virtual address 0021a817ce8721ad
[ 11.240897] Mem abort info:
[ 11.243680] ESR = 0x96000004
[ 11.246733] EC = 0x25: DABT (current EL), IL = 32 bits
[ 11.252039] SET = 0, FnV = 0
[ 11.255082] EA = 0, S1PTW = 0
[ 11.258212] Data abort info:
[ 11.261080] ISV = 0, ISS = 0x00000004
[ 11.264905] CM = 0, WnR = 0
[ 11.267862] [0021a817ce8721ad] address between user and kernel address ranges
[ 11.274986] Internal error: Oops: 96000004 [#1] SMP
[ 11.279852] Modules linked in: ast(+) drm_vram_helper ttm drm_kms_helper crct10dif_ce syscopyarea ghash_ce sysfillrect sysimgblt sha2_ce fb_sys_fops sha256_arm64 sha1_ce mpt3sas(+) drm nvme(+) ixgbe(+) raid_class igb(+) ice(+) nvme_core xfrm_algo scsi_transport_sas mdio i2c_algo_bit aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
[ 11.310860] CPU: 0 PID: 205 Comm: kworker/0:2 Not tainted 5.4.0-198-generic #218-Ubuntu
[ 11.318850] Hardware name: To Be Filled By O.E.M. ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 1.21 11/15/2023
...
[ 11.427088] Call trace:
[ 11.429522] __kmalloc+0xac/0x2d0
[ 11.432825] rh_call_control+0x210/0x938
[ 11.436735] usb_hcd_submit_urb+0x14c/0x3e8
[ 11.440906] usb_submit_urb+0x198/0x590
[ 11.444730] usb_start_wait_urb+0x70/0x160
[ 11.448814] usb_control_msg+0xc4/0x140
This seems to happen after the ASPEED USB port tries initializing?
Another trace:
[ 11.979019] ice 0004:01:00.0: The DDP package was successfully loaded: ICE OS Default Package version 1.3.4.0
[ 11.989174] Unable to handle kernel paging request at virtual address 0021a817ce8721ad
[ 11.997080] Mem abort info:
[ 11.999862] ESR = 0x96000004
[ 12.002906] EC = 0x25: DABT (current EL), IL = 32 bits
[ 12.008206] SET = 0, FnV = 0
[ 12.011250] EA = 0, S1PTW = 0
[ 12.014379] Data abort info:
[ 12.017247] ISV = 0, ISS = 0x00000004
[ 12.021072] CM = 0, WnR = 0
[ 12.024028] [0021a817ce8721ad] address between user and kernel address ranges
[ 12.031152] Internal error: Oops: 96000004 [#2] SMP
[ 12.036016] Modules linked in: hid_generic usbhid hid ast(+) drm_vram_helper ttm drm_kms_helper crct10dif_ce syscopyarea ghash_ce sysfillrect sysimgblt sha2_ce fb_sys_fops sha256_arm64 sha1_ce mpt3sas(+) drm nvme(+) ixgbe(+) raid_class igb(+) ice(+) nvme_core xfrm_algo scsi_transport_sas mdio i2c_algo_bit aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
[ 12.069019] CPU: 0 PID: 13 Comm: kworker/0:1 Tainted: G D 5.4.0-198-generic #218-Ubuntu
[ 12.078311] Hardware name: To Be Filled By O.E.M. ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 1.21 11/15/2023
[ 12.087518] Workqueue: events work_for_cpu_fn
[ 12.091862] pstate: a0c00009 (NzCv daif +PAN +UAO)
[ 12.096641] pc : kmem_cache_alloc_trace+0x94/0x278
[ 12.101418] lr : kmem_cache_alloc_trace+0x6c/0x278
[ 12.106195] sp : ffff800010253b20
[ 12.109497] x29: ffff800010253b20 x28: 0000000000000000
[ 12.114796] x27: ffffaebc2bed51cc x26: 0000000000000068
[ 12.120094] x25: ffff680e28007c00 x24: ffffaebc2bed51cc
[ 12.125393] x23: 0000000000028a97 x22: 0000000000000dc0
[ 12.130691] x21: 0000000000000000 x20: ae21a817ce8721ad
[ 12.135990] x19: ffff680e28007c00 x18: ffffaebc2d108538
[ 12.141288] x17: 0000000088e09f7b x16: ffffaebc2c49baf0
[ 12.146586] x15: ffff680e28688530 x14: ffff800010c9f000
[ 12.151885] x13: ffff680e28a0fe00 x12: ffff800010bb5000
[ 12.157183] x11: ffffaebc2d8f43a0 x10: ffff800010bb0000
[ 12.162482] x9 : 0000000000000041 x8 : 0000000000004000
[ 12.167780] x7 : ffffaebc2ddf2818 x6 : ffff680e2815b428
[ 12.173079] x5 : ffffaebc2c460670 x4 : ffff680e2f9f91e0
[ 12.178377] x3 : 0000000000100070 x2 : ae21a817ce8721ad
[ 12.183676] x1 : 0000000000000000 x0 : 5197a916cf8535d2
[ 12.188974] Call trace:
[ 12.191408] kmem_cache_alloc_trace+0x94/0x278
[ 12.195840] alloc_msi_entry+0x3c/0x98
[ 12.199578] __pci_enable_msix_range.part.0+0x3a4/0x5b0
[ 12.204790] __pci_enable_msix_range+0x64/0x90
[ 12.209221] pci_enable_msix_range+0x48/0x58
[ 12.213487] ice_probe+0x6a4/0xc68 [ice]
[ 12.217398] local_pci_probe+0x48/0xa0
[ 12.221135] work_for_cpu_fn+0x24/0x38
[ 12.224871] process_one_work+0x1d0/0x498
[ 12.228868] worker_thread+0x238/0x528
[ 12.232604] kthread+0xf0/0x118
[ 12.235733] ret_from_fork+0x10/0x18
[ 12.239296] Code: 54000e20 b9402261 f940ba60 8b010282 (f8616a81)
[ 12.245377] ---[ end trace 4029d97195803760 ]---
And then the system won't continue booting.
I think I'm running Ubuntu 20.04 on the HL15... it might be worth attempting upgrading to 24.04 :O
Otherwise maybe I can manually install later Intel drivers?
On Ampere's recommendation, I'm going to try a ConnectX-5 Mellanox card, the MCX512A-ACAT, instead.
Now I have a spare E810, ready to go into one of my Windows PCs :)
I have the X-5 installed, and it seems to enumerate correctly:
jgeerling@nas01:~$ dmesg | grep mlx5
[ 10.642917] mlx5_core 0004:01:00.0: Adding to iommu group 29
[ 10.643196] mlx5_core 0004:01:00.0: enabling device (0100 -> 0102)
[ 10.643316] mlx5_core 0004:01:00.0: firmware version: 16.27.2048
[ 10.643346] mlx5_core 0004:01:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 11.071754] mlx5_core 0004:01:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[ 11.085231] mlx5_core 0004:01:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
[ 11.097374] mlx5_core 0004:01:00.0: Port module event: module 0, Cable plugged
[ 11.097626] mlx5_core 0004:01:00.0: mlx5_pcie_event:294:(pid 542): PCIe slot advertised sufficient power (75W).
[ 11.108877] mlx5_core 0004:01:00.1: Adding to iommu group 31
[ 11.121246] mlx5_core 0004:01:00.1: enabling device (0100 -> 0102)
[ 11.138453] mlx5_core 0004:01:00.1: firmware version: 16.27.2048
[ 11.144495] mlx5_core 0004:01:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 11.451304] mlx5_core 0004:01:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[ 11.460281] mlx5_core 0004:01:00.1: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384)
[ 11.484789] mlx5_core 0004:01:00.1: Port module event: module 1, Cable unplugged
[ 11.492484] mlx5_core 0004:01:00.1: mlx5_pcie_event:294:(pid 545): PCIe slot advertised sufficient power (75W).
[ 11.516633] mlx5_core 0004:01:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[ 11.787574] mlx5_core 0004:01:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[ 54.125946] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0
[ 54.145352] mlx5_core 0004:01:00.0 enP4p1s0f0: renamed from eth0
[ 54.234030] mlx5_core 0004:01:00.1 enP4p1s0f1: renamed from eth1
It's not getting an IP address automatically, though. Not sure why.
Not detecting a link...
jgeerling@nas01:~$ ethtool enP4p1s0f1
Settings for enP4p1s0f1:
Supported ports: [ Backplane ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None BaseR RS
Advertised link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None
Speed: Unknown!
Duplex: Unknown! (255)
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Cannot get wake-on-lan settings: Operation not permitted
Current message level: 0x00000004 (4)
link
Link detected: no
I'm using a 10Gtek 25G SFP28 DAC - 3m, 30AWG, Passive... I wonder if this DAC isn't able to work with the card? Weird.
When I plug in the DAC, I see the changes:
Supported FEC modes: None BaseR RS # was 'Not Reported'
Advertised FEC modes: None # was 'Not Reported'
Port: Direct Attach Copper # was 'Other'
But it still says Link detected: no
.
On the switch (Mikrotik 25G), I'm seeing the link as negotiated at 25G:
Strangely, at some point this morning, it looks like the Intel interfaces were giving a bunch of errors:
[49658.032378] pcieport 0003:00:03.0: AER: Corrected error message received from 0003:03:00.0
[49658.032388] ixgbe 0003:03:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[49658.042484] ixgbe 0003:03:00.0: AER: device [8086:1563] error status/mask=00001000/00002000
[49658.051003] ixgbe 0003:03:00.0: AER: [12] Timeout
And the Mellanox driver is detecting cable hotplugs:
[58773.924177] mlx5_core 0004:01:00.0: Port module event: module 0, Cable unplugged
[58783.083281] mlx5_core 0004:01:00.1: Port module event: module 1, Cable plugged
Since this is 20.04, and I don't have NetworkManager present (so no nmcli
), I ran:
sudo ip link set enP4p1s0f1 down
sudo ip link set enP4p1s0f1 up
And dmesg
shows:
[59410.830394] mlx5_core 0004:01:00.1 enP4p1s0f1: Link up
[59410.833804] IPv6: ADDRCONF(NETDEV_CHANGE): enP4p1s0f1: link becomes ready
While ip a
shows:
4: enP4p1s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 6c:b3:11:29:4d:43 brd ff:ff:ff:ff:ff:ff
inet6 fe80::6eb3:11ff:fe29:4d43/64 scope link
valid_lft forever preferred_lft forever
So now it's getting IPv6, but not IPv4...
Also grabbing hardware details with sudo lshw -C network
:
*-network:0 DISABLED
description: Ethernet interface
product: MT27800 Family [ConnectX-5]
vendor: Mellanox Technologies
physical id: 0
bus info: pci@0004:01:00.0
logical name: enP4p1s0f0
version: 00
serial: 6c:b3:11:29:4d:42
capacity: 25Gbit/s
width: 64 bits
clock: 33MHz
capabilities: pciexpress vpd msix pm bus_master cap_list ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=mlx5_core firmware=16.27.2048 (MT_0000000080) latency=0 link=no multicast=yes
resources: iomemory:28000-27fff irq:89 memory:280000000000-280001ffffff memory:280004000000-2800047fffff
*-network:1
description: Ethernet interface
product: MT27800 Family [ConnectX-5]
vendor: Mellanox Technologies
physical id: 0.1
bus info: pci@0004:01:00.1
logical name: enP4p1s0f1
version: 00
serial: 6c:b3:11:29:4d:43
capacity: 25Gbit/s
width: 64 bits
clock: 33MHz
capabilities: pciexpress vpd msix pm bus_master cap_list ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=mlx5_core duplex=full firmware=16.27.2048 (MT_0000000080) latency=0 link=yes multicast=yes
resources: iomemory:28000-27fff irq:260 memory:280002000000-280003ffffff memory:280004800000-280004ffffff
Huh. Forcing a release/renew grabbed an IP for the interface:
sudo dhclient -r enP4p1s0f1
sudo dhclient enP4p1s0f1
jgeerling@nas01:~$ ip a
...
4: enP4p1s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 6c:b3:11:29:4d:43 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.236/24 brd 10.0.2.255 scope global dynamic enP4p1s0f1
valid_lft 7199sec preferred_lft 7199sec
inet6 fe80::6eb3:11ff:fe29:4d43/64 scope link
valid_lft forever preferred_lft forever
Now the question is, will the configuration persist across a reboot?
Nope. But following this Stack Exchange answer, I did the following to make the new card's Ethernet interfaces persist with IPv4 DHCP across reboots:
$ sudo nano /etc/netplan/00-installer-config.yaml
# Add in the interfaces among the others and save:
enP4p1s0f0:
dhcp4: true
enP4p1s0f1:
dhcp4: true
$ sudo netplan apply
$ sudo dhclient -r enP4p1s0f1
$ sudo dhclient enP4p1s0f1
And now even after a reboot, I'm getting full 25 Gbps bandwidth, yay!
jgeerling@nas01:~$ sudo ethtool enP4p1s0f1
Settings for enP4p1s0f1:
Supported ports: [ Backplane ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None BaseR RS
Advertised link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None
Speed: 25000Mb/s
Duplex: Full
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000004 (4)
link
Link detected: yes
Full docs on Ubuntu's docs site: Configuring networks
I guess the 00-installer-config.yaml
is created at system install time, and since this card wasn't present, it doesn't show up there. Ah well. I could create 99-mellanox.yaml
and tack it on that way, but as this hardware change is likely permanent(ish), I'm happy just throwing the config in the installer.
I purchased a PCIe Gen 4 SFP28 NIC with Intel E810-XXVAM2 on Amazon, and would like to install it in the server to get dual 25 Gbps Ethernet on the NAS.
Some of my other gear is starting to come online at 25G, and it would be nice to have a storage target capable of saturating the network!
Intel has a driver download page here: Intel® Network Adapter Driver for E810 Series Devices under Linux*