Open atomass opened 3 years ago
Hmm, hard to say. Do you have any data points between 4.16 and 5.9? Can you do a bisect?
-12 indicates ENOMEM which can happen in a couple different places including being unable to map the BAR.
The failed to assign BAR message is concerning, but may not actually be a problem as it is an iterative process. Can you post the output of lspci -v -s 0000:01:00.1
to check if the BAR is actually assigned?
Hi @lsgunth , thank you for you reply. I tried it even on v5.4 but, because support for Gen4 was introduced in mainline v5.6, I git cloned this repo, build and install the modules. The results are the same as on v5.9 and v5.10-rc6:
$ sudo modprobe switchtec dyndbg=+p
$ sudo modprobe ntb_hw_switchtec dyndbg=+p
$ dmesg
... snip ...
[ 1158.396927] switchtec 0000:01:00.1: enabling device (0000 -> 0002)
[ 1158.402726] switchtec switchtec0: Management device registered.
[ 1158.403707] switchtec: loaded.
[ 1158.762996] switchtec switchtec0: failed to register ntb device: -12
and here the output you requested:
$ sudo lspci -v -s 0000:01:00.1
01:00.1 Bridge: PMC-Sierra Inc. Device 4000
Subsystem: PMC-Sierra Inc. Device 4000
Flags: bus master, fast devsel, latency 0
Memory at df400000 (64-bit, prefetchable) [size=4M]
Memory at <ignored> (64-bit, prefetchable)
Capabilities: [40] MSI: Enable- Count=1/4 Maskable- 64bit+
Capabilities: [50] MSI-X: Enable+ Count=4 Masked-
Capabilities: [5c] Power Management version 3
Capabilities: [64] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [148] Multicast
Capabilities: [178] Device Serial Number 50-0e-00-4a-00-00-00-01
Capabilities: [7f8] Vendor Specific Information: ID=ffff Rev=1 Len=808 <?>
Kernel driver in use: switchtec
As I mentioned before, if I just reboot the same machine and select Kernel v4.16, without changing any switch configuration, it works smoothly.
Do you have in mind any other Kernel version I can use to help you with the bisect? I would like to avoid to have to check every version between 4.16 and 5.4 without any specific clue.
Just another, I hope useful, comment on this issue. Tracking down the problem, I found out that the driver fails to register because it fails in ntb_hw_switchtec.c
at line 1514 while, inside the function switchtec_ntb_init_shared_mw
, it tries to pci_iomap
sndev->peer_shared = pci_iomap(sndev->stdev->pdev, self_bar, LUT_SIZE);
Tests done so far:
Kernel | Driver source | Result |
---|---|---|
v4.16 | Github (switchtec-kernel tag: v0.2-rc1) | OK |
v5.4 | Github (switchtec-kernel branch: master) | NOK (switchtec switchtec0: failed to register ntb device: -12) |
v5.9 & 5.10 | Shipped with the Kernel | NOK (switchtec switchtec0: failed to register ntb device: -12) |
The problem is almost certainly due to the 2nd BAR being unassigned. (You can see it as ignored in your lspci dump). This could be a bios bug or simply that there isn't enough address space.
I'm a bit surprised that there's a difference here between the kernel versions, but I guess it's not impossible. There is some code to fix up these bios bugs but it's non-trivial and a bit buggy.
It also looks like you only have 32bit PCI addressing which means address space may be limited for a large bar. How large have you set that BAR? I'd probably look at enabling 64bit decoding (usually a bios option) so there's more address space for that BAR. It might have only squeezed in with v4.16.
You were absolutely right. BAR2 size was 64M, decreasing it to 16M solved the issue on kernel versions >= 5.4
The extremely strange behaviour is that kernel 4.16 was able to correctly assign the BAR (I run the same lspci
command and the output was good).
I will check if in BIOS there's a setting to enable 64bit decoding. Just a side question: how did you figure out PCI addressing was only using 32bit?
Thank you very much for your support
The address of BAR0 was 0xdf400000 which is under 32bits. When 64bit decoding is turned on, the 64bit PCI bars tend to be in a large region well above the 32bit address space.
It's very odd that such a small 64MB bar was not assignable. Usually there's more space than that, but this is very likely a bios bug. Turning on 64bit decoding is probably the easiest solution.
Unfortunately there's no such an option on current BIOS. Maybe we should take into consideration to change motherboards.
I have another very odd situation now, both PCs are running on kernel v5.10-rc6 and ntb_hw_switchtec
drivers load fine on both machines. Then I modprobe ntb_netdev
on PC1: everything fine. But, as soon as I modprobe ntb_netdev
on PC2 I receive this error message on both:
[ 480.146143] switchtec switchtec0: MW 0: part 0 addr 0x000000041df00000 size 0x0000000000200000
[ 480.146149] switchtec switchtec0: ERROR: Memory window address is not aligned to it's size!
[ 480.146159] switchtec 0000:01:00.1: Unable to set mw0 translation
[ 480.146164] switchtec switchtec0: MW 0: part 0 addr 0x0000000000000000 size 0x0000000000000000
On the other hand, if both machines run on v4.16 ntb_netdev
correctly loads and iperf3
shows a nice ~10Gbp/sec
Any clue?
$ sudo lspci -v -s 0000:01:00.1
01:00.1 Bridge: PMC-Sierra Inc. Device 4000
Subsystem: PMC-Sierra Inc. Device 4000
Flags: bus master, fast devsel, latency 0
Memory at f3000000 (64-bit, prefetchable) [size=4M]
Memory at f2000000 (64-bit, prefetchable) [size=16M]
Capabilities: [40] MSI: Enable- Count=1/4 Maskable- 64bit+
Capabilities: [50] MSI-X: Enable+ Count=4 Masked-
Capabilities: [5c] Power Management version 3
Capabilities: [64] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [148] Multicast
Capabilities: [178] Device Serial Number 50-0e-00-4a-00-00-00-01
Capabilities: [7f8] Vendor Specific Information: ID=ffff Rev=1 Len=808 <?>
Kernel driver in use: switchtec
Kernel modules: switchtec
I really find it hard to believe there is no 64bit decoding option. Sometimes they are named weird things. What motherboard are you using?
Looks like the bios is really buggy as the BAR was assigned to an unaligned address. That really won't work. You might have to shrink it even further to get it aligned.
Here you can find my current motherboard's manual: https://download1.gigabyte.com/Files/Manual/mb_manual_ga-z97(h97)-d3h_v1.1_e.pdf
The oddest thing that prevents me from sleeping at night is that when both PCs run Linux kernel v4.16 everything runs smoothly and iperf
shows a transfer rate of ~10Gbps
Yup, I don't see an option in that manual. Possibly a combination of it being so old and just a consumer motherboard.
It's very odd to see the supposedly same machine assign such wildly different PCI address just based on a different kernel version. I know there have been a few minor changes to the kernel code that fixes up addresses assigned by broken bioses. It could have broke your use case; I do know that code is quite fragile. All I could suggest is bisect between the kernel versions based on the PCI addresses assigned to the cards.
Have you solve the issue yet? I also encountered the same issue with Dolphin PM40036 PFX 36xG4.
[ 673.144040] switchtec: loading out-of-tree module taints kernel.
[ 673.144084] switchtec: module verification failed: signature and/or required key missing - tainting kernel
[ 673.144836] switchtec 0000:01:00.1: enabling device (0000 -> 0002)
[ 673.149496] switchtec switchtec0: Management device registered.
[ 673.150487] switchtec: loaded.
[ 673.391363] switchtec switchtec0: failed to register ntb device: -12
Kernel: 5.4.0-104-generic
switchtec driver was downloaded from Github.
It failed here because the driver cannot find a BAR for crosslink. Is there anything needed to modify before switchtec driver cannot be loaded properly?
I git cloned, build and installed Kernel v5.10-rc6 but here the
ntb_hw_switchtec
driver fails to register:Plus a bunch of
Full dmsg: v5.10-rc6
I experienced the same behaviour on Kernel v5.9 shipped with Ubuntu 20.04. On the other hand, the same exact setup works flawlessly on Kernel v4.16.
Are there any known issues with Kernel version > 5.9?
HW information
Distro