Open semperrin opened 2 years ago
Hmmm, it sounds like there may not be an NTB device. Could be a configuration issue with one of the switches or an number of other issues. Do you see anything in dmesg related to switchtec?
After the command:
sudo modprobe switchtec
I get the following in dmesg:
[Feb23 14:51] switchtec: loaded.
And after the command:
sudo modprobe ntb_transport
I get the following in dmesg:
[ +31.978234] Software Queue-Pair Transport over NTB, version 4
Those are the only messages I get in dmesg
My issue seems quite similar to https://github.com/Microsemi/switchtec-kernel/issues/106 as when I also run switchtec-user command switchtec -list it returns:
free(): invalid pointer
Aborted (core dumped)
Yup. Your switch is not configured to have a management or NTB endpoint so there is no device for the drivers to attached to.
Also seems there's a bug in switchtec list.... it shouldn't be core dumping.
The core dumping appears to be addressed in this pull request: https://github.com/Microsemi/switchtec-user/pull/263
The PM40036 chip is within a Dolphin MXH930 NTB host adapter. Is the switch configuration you are referring to a HW or SW change? If SW, do you know how this configuration is managed and if it can be changed?
Yes, it's part of the firmware download. It's usually done with the Chiplink software. You might need to contact Microchip or your vendor (Dolphin) to get that setup. It's odd to me that a card designed for NTB doesn't have it configured correctly to begin with.
Does the following output confirm that the switch is configured to be a NTB endpoint? My initial assumption was that it did.
$ lspci | grep Sierra
07:00.0 PCI bridge: PMC-Sierra Inc. Device 4036
07:00.1 Bridge: PMC-Sierra Inc. Device 4036
07:00.2 System peripheral: PMC-Sierra Inc. Device 4036
08:00.0 PCI bridge: PMC-Sierra Inc. Device 4036
If the above does not confirm that the switch is configured to be a NTB endpoint would:
ls /dev/switchtec*
listing a device confirm this? (currently that command lists no such devices for me) Is there some other way to confirm that the switch is or is not configured to be a NTB endpoint?
I tried to load switchtec drivers (ntb.ko, switchtec.ko and ntb_hw_switchtec.ko) for Dolphin MXP930. But it appeared failing to find the crosslink
partition while enumerating the BARs.
[ 673.144040] switchtec: loading out-of-tree module taints kernel.
[ 673.144084] switchtec: module verification failed: signature and/or required key missing - tainting kernel
[ 673.144836] switchtec 0000:01:00.1: enabling device (0000 -> 0002)
[ 673.149496] switchtec switchtec0: Management device registered.
[ 673.150487] switchtec: loaded.
[ 673.391363] switchtec switchtec0: failed to register ntb device: -12
[ 1225.316886] switchtec switchtec0: unregistered.
Is there anything needed to be configured on the adapter (e.g. update firmware) or modify the switchtec drivers before I can load the drivers successfully? Please advise!
@semperrin The lspci trace doesn't tell us much. If you aren't getting a /dev/switchtec device then it's not configured for NTB and it is not configured with a management endpoint and there's not much you can do about that besides reconfigure it.
@Kendid You got a -12 error which is ENOMEM. This is not a likely error. It can happen if your system has no memory (unlikely) but it can also happen if the kernel is unable to map parts of the PCI bar. My guess is the switch's BARs are not configured appropriately for the driver and it's trying to map a bar that doesn't exist.
I've added some debug print to _crosslink_enumpartition().
lspci shows:
Region 0: Memory at b5000000 (32-bit, non-prefetchable) [disabled] [size=4M]
Region 2: Memory at a0000000 (64-bit, prefetchable) [disabled] [size=256M]
Region 4: Memory at b0000000 (32-bit, non-prefetchable) [disabled] [size=64M]
Region 5: Memory at b4800000 (32-bit, non-prefetchable) [disabled] [size=8M]
Now dmesg returned:
[ 66.519772] switchtec switchtec0: Crosslink BAR0 addr: 0
[ 66.519800] switchtec switchtec0: Crosslink BAR2 addr: 0
[ 66.519829] switchtec switchtec0: Crosslink BAR4 addr: 0
[ 66.519832] switchtec switchtec0: Error enumerating crosslink partition
[ 66.519840] switchtec switchtec0: failed to register NTB device: -22
Don't understand why BARs couldn't be read properly.
`/dev/switchtec0' appears after the driver is loaded. Is there any way the BARs reading can be corrected/adjusted?
Now you're getting error 22? (EINVAL)? Did you change something in the error path? Maybe confirm where in the code the error is actually happening.
The fact that lspci indicates the bars are disabled usually just means the driver isn't loaded yet.
Oh, I only added debug messages.
-22 is -EINVAL. It's returned right after Error enumerating crosslink partition
was printed, in _switchtec_ntb_initcrosslink().
You are right. Previous print was done before drivers were loaded.
After drivers are loaded:
Region 0: Memory at b5000000 (32-bit, non-prefetchable) [size=4M]
Region 2: Memory at a0000000 (64-bit, prefetchable) [size=256M]
Region 4: Memory at b0000000 (32-bit, non-prefetchable) [size=64M]
Region 5: Memory at b4800000 (32-bit, non-prefetchable) [size=8M]
Hmmm, the enumerating cross link partitions error is likely caused by the middle partition not being configured correctly with the right type of bars.
Cross link is very tricky and needs a specific switch configuration. Microchip used to have an app note for that. You should probably get in touch with your vendor.
My card is Dolphin MXH930. I wonder typically does anything need to be done to it before switchtec drivers can load, detect and register the device to the ntb core successfully? Thanks.
No idea. I have no clue what that card is or how it's setup. You should probably contact Dolphin for support.
I see.
I wonder where can I learn more on how Microsemi/switchtec Crosslink works and its requirements , etc?
Regarding the following in _switchtec_ntb_init_crosslink()_:
if (bar_cnt < sndev->nr_direct_mw + 1) {
dev_err(&sndev->stdev->dev,
"Error enumerating crosslink partition\n");
return -EINVAL;
}
My current environment reports:
bar_cnt = 1
nr_direct_mw =3
What approach would you recommend to debug thist? Why bar_cnt
should be >= nr_direct_mw + 1
in order for it to proceed? Thanks!
It appears crosslink config is looking for the following BAR addresses from vEP, am I correct?
1. 0x00_0000_0000
2. 0x10_0000_0000
3. 0x20_0000_0000
As I recall, Microsemi had an app note on crosslink, but as far as I know the only way to get it is through their support. If you don't have support to understand how the switch needs to be configured and to reconfigure it, I'm not sure you are going to be able to make it work at all.
Yes, cross link is looking at the configuration of the BARs in the virtual partition. It doesn't have enough bars to map the bars in the real partitions, so it just bails.
I see. Thank you Logan for the confirmation and info,!
So if server A wants to share a NVMe SSD to server B, can both NTB adapters on server A and server B have the same configuration/firmware? Or should they be different since the server's role are different? Thanks.
I'm not aware of any solutions for NVMe sharing that are not complicated and proprietary. So you're pretty much on your own if you want to implement something like that.
The point of cross link was to make both machines symmetric. (So cases where each machine has a switch which connect to each other). In these cases the configuration for each switch should be the identical.
Just as it's possible to write a driver to share the NVMe drive between two partitions on a single switch, it should, at least theoretically, be possible to share an NVMe drive over a cross link setup. Though, I don't know if anyone has actually ever tried this.
Okie. Thanks Logan!
I tried to look up for more info. on Microsemi ChipLink but I couldn't find anything about it. It is rebranded to another product or something?
Hi Kendidi,
Microsemi was purchased by Microchip. You can find info on ChipLink on Microchip’s YouTube channel and on their web site. Some resources will be restricted to customers.
From: Kendidi @.> Sent: Tuesday, April 5, 2022 11:56 AM To: Microsemi/switchtec-kernel @.> Cc: Subscribed @.***> Subject: Re: [Microsemi/switchtec-kernel] No such device error (Issue #114)
EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
Okie. Thanks Logan!
I tried to look up for more info. on Microsemi ChipLink but I couldn't find anything about it. It is rebranded to another product or something?
— Reply to this email directly, view it on GitHubhttps://github.com/Microsemi/switchtec-kernel/issues/114#issuecomment-1089190325, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGCPXO2NF4C7HSJO4FEFPSLVDSECTANCNFSM5O7NO7SQ. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>
Thanks @jborz27 ! I will see where it can be downloaded.
Any suggestions on where to and how to get Microsemi/Microchip Chiplink?
I am trying to transport data using IPoPCI via ntb_netdev. I am following the general outline given in step 4 of the "Non-Crosslink NTB connection for Linux" section: https://docs.nvidia.com/drive/drive_os_5.1.6.1L/nvvib_docs/index.html#page/DRIVE_OS_Linux_SDK_Development_Guide/System%20Programming/sys_components_non_transparent_bridging.html One machine is running Ubuntu 18.04 and has a PM40036 and the other machine is running Ubuntu 16.04 and has a PM8534 switch.
I am trying to load the kernels doing the following
When I load
ntb_transport
I get the first line:However, I do not get the second line:
Then if I try to load
ntb_netdev
I get the following:Can you provide me with any information on what may be causing this error?