Microsemi / switchtec-kernel

A kernel module for the Microsemi PCIe switch
GNU General Public License v2.0
45 stars 31 forks source link

No such device error #114

Open semperrin opened 2 years ago

semperrin commented 2 years ago

I am trying to transport data using IPoPCI via ntb_netdev. I am following the general outline given in step 4 of the "Non-Crosslink NTB connection for Linux" section: https://docs.nvidia.com/drive/drive_os_5.1.6.1L/nvvib_docs/index.html#page/DRIVE_OS_Linux_SDK_Development_Guide/System%20Programming/sys_components_non_transparent_bridging.html One machine is running Ubuntu 18.04 and has a PM40036 and the other machine is running Ubuntu 16.04 and has a PM8534 switch.

I am trying to load the kernels doing the following

modprobe ntb
modprobe switchtec
modprobe ntb_hw_switchtec
modprobe ntb_transport
modprobe ntb_netdev

When I load ntb_transport I get the first line:

[  559.033375] Software Queue-Pair Transport over NTB, version 4

However, I do not get the second line:

[  559.034097] switchtec switchtec0: ntb link up

Then if I try to load ntb_netdev I get the following:

modprobe: ERROR: could not insert 'ntb_netdev': No such device

Can you provide me with any information on what may be causing this error?

lsgunth commented 2 years ago

Hmmm, it sounds like there may not be an NTB device. Could be a configuration issue with one of the switches or an number of other issues. Do you see anything in dmesg related to switchtec?

semperrin commented 2 years ago

After the command:

sudo modprobe switchtec

I get the following in dmesg:

[Feb23 14:51] switchtec: loaded.

And after the command:

sudo modprobe ntb_transport

I get the following in dmesg:

[ +31.978234] Software Queue-Pair Transport over NTB, version 4

Those are the only messages I get in dmesg

semperrin commented 2 years ago

My issue seems quite similar to https://github.com/Microsemi/switchtec-kernel/issues/106 as when I also run switchtec-user command switchtec -list it returns:

free(): invalid pointer
Aborted (core dumped)
lsgunth commented 2 years ago

Yup. Your switch is not configured to have a management or NTB endpoint so there is no device for the drivers to attached to.

Also seems there's a bug in switchtec list.... it shouldn't be core dumping.

semperrin commented 2 years ago

The core dumping appears to be addressed in this pull request: https://github.com/Microsemi/switchtec-user/pull/263

semperrin commented 2 years ago

The PM40036 chip is within a Dolphin MXH930 NTB host adapter. Is the switch configuration you are referring to a HW or SW change? If SW, do you know how this configuration is managed and if it can be changed?

lsgunth commented 2 years ago

Yes, it's part of the firmware download. It's usually done with the Chiplink software. You might need to contact Microchip or your vendor (Dolphin) to get that setup. It's odd to me that a card designed for NTB doesn't have it configured correctly to begin with.

semperrin commented 2 years ago

Does the following output confirm that the switch is configured to be a NTB endpoint? My initial assumption was that it did.

$ lspci | grep Sierra
07:00.0 PCI bridge: PMC-Sierra Inc. Device 4036
07:00.1 Bridge: PMC-Sierra Inc. Device 4036
07:00.2 System peripheral: PMC-Sierra Inc. Device 4036
08:00.0 PCI bridge: PMC-Sierra Inc. Device 4036

If the above does not confirm that the switch is configured to be a NTB endpoint would:

ls /dev/switchtec*

listing a device confirm this? (currently that command lists no such devices for me) Is there some other way to confirm that the switch is or is not configured to be a NTB endpoint?

Kendidi commented 2 years ago

I tried to load switchtec drivers (ntb.ko, switchtec.ko and ntb_hw_switchtec.ko) for Dolphin MXP930. But it appeared failing to find the crosslink partition while enumerating the BARs.

[ 673.144040] switchtec: loading out-of-tree module taints kernel. [ 673.144084] switchtec: module verification failed: signature and/or required key missing - tainting kernel [ 673.144836] switchtec 0000:01:00.1: enabling device (0000 -> 0002) [ 673.149496] switchtec switchtec0: Management device registered. [ 673.150487] switchtec: loaded. [ 673.391363] switchtec switchtec0: failed to register ntb device: -12 [ 1225.316886] switchtec switchtec0: unregistered.

Is there anything needed to be configured on the adapter (e.g. update firmware) or modify the switchtec drivers before I can load the drivers successfully? Please advise!

lsgunth commented 2 years ago

@semperrin The lspci trace doesn't tell us much. If you aren't getting a /dev/switchtec device then it's not configured for NTB and it is not configured with a management endpoint and there's not much you can do about that besides reconfigure it.

@Kendid You got a -12 error which is ENOMEM. This is not a likely error. It can happen if your system has no memory (unlikely) but it can also happen if the kernel is unable to map parts of the PCI bar. My guess is the switch's BARs are not configured appropriately for the driver and it's trying to map a bar that doesn't exist.

Kendidi commented 2 years ago

I've added some debug print to _crosslink_enumpartition().

lspci shows: Region 0: Memory at b5000000 (32-bit, non-prefetchable) [disabled] [size=4M] Region 2: Memory at a0000000 (64-bit, prefetchable) [disabled] [size=256M] Region 4: Memory at b0000000 (32-bit, non-prefetchable) [disabled] [size=64M] Region 5: Memory at b4800000 (32-bit, non-prefetchable) [disabled] [size=8M]

Now dmesg returned: [ 66.519772] switchtec switchtec0: Crosslink BAR0 addr: 0 [ 66.519800] switchtec switchtec0: Crosslink BAR2 addr: 0 [ 66.519829] switchtec switchtec0: Crosslink BAR4 addr: 0 [ 66.519832] switchtec switchtec0: Error enumerating crosslink partition [ 66.519840] switchtec switchtec0: failed to register NTB device: -22

Don't understand why BARs couldn't be read properly.

`/dev/switchtec0' appears after the driver is loaded. Is there any way the BARs reading can be corrected/adjusted?

lsgunth commented 2 years ago

Now you're getting error 22? (EINVAL)? Did you change something in the error path? Maybe confirm where in the code the error is actually happening.

The fact that lspci indicates the bars are disabled usually just means the driver isn't loaded yet.

Kendidi commented 2 years ago

Oh, I only added debug messages.

-22 is -EINVAL. It's returned right after Error enumerating crosslink partition was printed, in _switchtec_ntb_initcrosslink().

You are right. Previous print was done before drivers were loaded.

After drivers are loaded: Region 0: Memory at b5000000 (32-bit, non-prefetchable) [size=4M] Region 2: Memory at a0000000 (64-bit, prefetchable) [size=256M] Region 4: Memory at b0000000 (32-bit, non-prefetchable) [size=64M] Region 5: Memory at b4800000 (32-bit, non-prefetchable) [size=8M]

lsgunth commented 2 years ago

Hmmm, the enumerating cross link partitions error is likely caused by the middle partition not being configured correctly with the right type of bars.

Cross link is very tricky and needs a specific switch configuration. Microchip used to have an app note for that. You should probably get in touch with your vendor.

Kendidi commented 2 years ago

My card is Dolphin MXH930. I wonder typically does anything need to be done to it before switchtec drivers can load, detect and register the device to the ntb core successfully? Thanks.

lsgunth commented 2 years ago

No idea. I have no clue what that card is or how it's setup. You should probably contact Dolphin for support.

Kendidi commented 2 years ago

I see.

I wonder where can I learn more on how Microsemi/switchtec Crosslink works and its requirements , etc?

Regarding the following in _switchtec_ntb_init_crosslink()_:

    if (bar_cnt < sndev->nr_direct_mw + 1) {
        dev_err(&sndev->stdev->dev,
            "Error enumerating crosslink partition\n");
        return -EINVAL;
    }

My current environment reports:

  bar_cnt = 1
  nr_direct_mw =3

What approach would you recommend to debug thist? Why bar_cnt should be >= nr_direct_mw + 1 in order for it to proceed? Thanks!

Kendidi commented 2 years ago

It appears crosslink config is looking for the following BAR addresses from vEP, am I correct?

1.  0x00_0000_0000
2.  0x10_0000_0000
3.  0x20_0000_0000
lsgunth commented 2 years ago

As I recall, Microsemi had an app note on crosslink, but as far as I know the only way to get it is through their support. If you don't have support to understand how the switch needs to be configured and to reconfigure it, I'm not sure you are going to be able to make it work at all.

Yes, cross link is looking at the configuration of the BARs in the virtual partition. It doesn't have enough bars to map the bars in the real partitions, so it just bails.

Kendidi commented 2 years ago

I see. Thank you Logan for the confirmation and info,!

So if server A wants to share a NVMe SSD to server B, can both NTB adapters on server A and server B have the same configuration/firmware? Or should they be different since the server's role are different? Thanks.

lsgunth commented 2 years ago

I'm not aware of any solutions for NVMe sharing that are not complicated and proprietary. So you're pretty much on your own if you want to implement something like that.

The point of cross link was to make both machines symmetric. (So cases where each machine has a switch which connect to each other). In these cases the configuration for each switch should be the identical.

Just as it's possible to write a driver to share the NVMe drive between two partitions on a single switch, it should, at least theoretically, be possible to share an NVMe drive over a cross link setup. Though, I don't know if anyone has actually ever tried this.

Kendidi commented 2 years ago

Okie. Thanks Logan!

I tried to look up for more info. on Microsemi ChipLink but I couldn't find anything about it. It is rebranded to another product or something?

jborz27 commented 2 years ago

Hi Kendidi,

Microsemi was purchased by Microchip. You can find info on ChipLink on Microchip’s YouTube channel and on their web site. Some resources will be restricted to customers.

From: Kendidi @.> Sent: Tuesday, April 5, 2022 11:56 AM To: Microsemi/switchtec-kernel @.> Cc: Subscribed @.***> Subject: Re: [Microsemi/switchtec-kernel] No such device error (Issue #114)

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

Okie. Thanks Logan!

I tried to look up for more info. on Microsemi ChipLink but I couldn't find anything about it. It is rebranded to another product or something?

— Reply to this email directly, view it on GitHubhttps://github.com/Microsemi/switchtec-kernel/issues/114#issuecomment-1089190325, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGCPXO2NF4C7HSJO4FEFPSLVDSECTANCNFSM5O7NO7SQ. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

Kendidi commented 2 years ago

Thanks @jborz27 ! I will see where it can be downloaded.

Kendidi commented 2 years ago

Any suggestions on where to and how to get Microsemi/Microchip Chiplink?