dentproject / dentOS

dentOS SwitchDev based NOS
Other
200 stars 59 forks source link

Aldrin2 Firmware 3.0.1: NO-CARRIER even when everything else indicates link #152

Open aep opened 2 years ago

aep commented 2 years ago

i'm puzzled if this is a bug in firmware 3.0.1 or maybe an installation error. after moving a AS5114-48X, all links are down. SFP modules are detected:

[   66.044885] sfp sfp-1: module OEM              AXS85-192-M3     rev A    sn CSG101KA5702     dc 201024

ethtool says its up, but ip link still shows NO-CARRIER. consequently all the routes are dead.

ethtool -S says it's receiving packets just fine, but not sending any (probably since its marked no-carrier)

root@localhost:~# ethtool -S swp1
NIC statistics:
     good_octets_received: 3864300
     good_octets_sent: 0

unfortunately i never tested a cold boot before deploying. fw3.1 seemed to work fine in my tests where i plugged in the SFPs after boot.

how does carrier detection work, and is there a possibility i need to something other than "ip link set up" for it to work?

paulmenzel commented 2 years ago

Is it firmware version 3.1 or 3.0.1?

aep commented 2 years ago

sorry, yes 3.0.1.

btw /sys/class/net/swp1/operstate is still down after ip link set up. not sure if that indicates something

paulmenzel commented 2 years ago

For correctness, please edit/update the title and original report.

PS: Fingers crossed, that Marvell and plvision.eu folks are going to help you quickly.

sonoble commented 2 years ago

There is a patch for 3.1.0 rc1 on the marvell-switching GitHub if you are willing to try that. Do you know what the OS/SDK version on the a385 (firmware) CPU is?

On Thu, Nov 18, 2021, 2:15 AM Paul Menzel @.***> wrote:

For correctness, please edit/update the title and original report.

PS: Fingers crossed, that Marvell and plvision.eu folks are going to help you quickly.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dentproject/dentOS/issues/152#issuecomment-972725713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAP5WHYWADWW6D7BCXNEFSLUMTG5NANCNFSM5IJFVTLQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

aep commented 2 years ago

There is a patch for 3.1.0 rc1 on the marvell-switching GitHub

dont think its the firmware after all. i dowgraded to 2.8.0 and still have the same issue.

Do you know what the OS/SDK version on the a385

how do i find out? i can access uboot. or do you mean dentos? currently trying revision 3480acea, before the 3.0 update.

additional discovery: i can even see traffic being trapped to kernel via tcpdump, but there is no outgoing packets , possibly because the route is marked linkdown

10.100.10.0/24 dev swp17 proto kernel scope link src 10.100.10.17 offload linkdown 

a layer2 bridge, which should be entirely within the asic, also doesnt forward or learn anything

unfortunately the driver is confusing to read since its a large patch file. the only time it changes carrier state might be in mvsw_pr_port_handle_event, which originates in the binary blob.

but i'm currently trying to find the origin of this message "[ 420.757989] Aldrin2 0000:01:00.0 swp20: configuring for inband/10gbase-r link mode"

taraschornyiplv commented 2 years ago

can you pls provide the output of onlpdump. also can please you test the DAC cable?

aep commented 2 years ago

onlpdump

see attached

onldump.txt

also can please you test the DAC cable?

unfortunately i cant. this device is already deployed. however, i had a technician loop one of the fibers from port 47 to port 48 and the modules happily report receiving a signal:

root@localhost:~# ethtool -m swp47 | grep -i rece
    Receiver signal average optical power     : 0.5889 mW / -2.30 dBm
root@localhost:~# ethtool -m swp48 | grep -i rece
    Receiver signal average optical power     : 0.7221 mW / -1.41 dBm

full module output as text here

root@localhost:~# ethtool -m swp47
    Identifier                                : 0x03 (SFP)
    Extended identifier                       : 0x04 (GBIC/SFP defined by 2-wire interface ID)
    Connector                                 : 0x07 (LC)
    Transceiver codes                         : 0x10 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
    Transceiver type                          : 10G Ethernet: 10G Base-SR
    Encoding                                  : 0x06 (64B/66B)
    BR, Nominal                               : 10300MBd
    Rate identifier                           : 0x00 (unspecified)
    Length (SMF,km)                           : 0km
    Length (SMF)                              : 0m
    Length (50um)                             : 80m
    Length (62.5um)                           : 20m
    Length (Copper)                           : 0m
    Length (OM3)                              : 300m
    Laser wavelength                          : 850nm
    Vendor name                               : OEM
    Vendor OUI                                : 00:90:65
    Vendor PN                                 : AXS85-192-M3
    Vendor rev                                : A
    Option values                             : 0x00 0x1a
    Option                                    : RX_LOS implemented
    Option                                    : TX_FAULT implemented
    Option                                    : TX_DISABLE implemented
    BR margin, max                            : 0%
    BR margin, min                            : 0%
    Vendor SN                                 : CSG101KA5706
    Date code                                 : 201024
    Optical diagnostics support               : Yes
    Laser bias current                        : 6.888 mA
    Laser output power                        : 0.5330 mW / -2.73 dBm
    Receiver signal average optical power     : 0.5964 mW / -2.24 dBm
    Module temperature                        : 27.16 degrees C / 80.88 degrees F
    Module voltage                            : 3.4092 V
    Alarm/warning flags implemented           : Yes
    Laser bias current high alarm             : Off
    Laser bias current low alarm              : Off
    Laser bias current high warning           : Off
    Laser bias current low warning            : Off
    Laser output power high alarm             : Off
    Laser output power low alarm              : Off
    Laser output power high warning           : Off
    Laser output power low warning            : Off
    Module temperature high alarm             : Off
    Module temperature low alarm              : Off
    Module temperature high warning           : Off
    Module temperature low warning            : Off
    Module voltage high alarm                 : Off
    Module voltage low alarm                  : Off
    Module voltage high warning               : Off
    Module voltage low warning                : Off
    Laser rx power high alarm                 : Off
    Laser rx power low alarm                  : Off
    Laser rx power high warning               : Off
    Laser rx power low warning                : Off
    Laser bias current high alarm threshold   : 15.000 mA
    Laser bias current low alarm threshold    : 1.000 mA
    Laser bias current high warning threshold : 13.000 mA
    Laser bias current low warning threshold  : 2.000 mA
    Laser output power high alarm threshold   : 2.5118 mW / 4.00 dBm
    Laser output power low alarm threshold    : 0.1258 mW / -9.00 dBm
    Laser output power high warning threshold : 1.9952 mW / 3.00 dBm
    Laser output power low warning threshold  : 0.1584 mW / -8.00 dBm
    Module temperature high alarm threshold   : 90.00 degrees C / 194.00 degrees F
    Module temperature low alarm threshold    : -10.00 degrees C / 14.00 degrees F
    Module temperature high warning threshold : 85.00 degrees C / 185.00 degrees F
    Module temperature low warning threshold  : -5.00 degrees C / 23.00 degrees F
    Module voltage high alarm threshold       : 3.6000 V
    Module voltage low alarm threshold        : 2.9000 V
    Module voltage high warning threshold     : 3.5000 V
    Module voltage low warning threshold      : 3.0000 V
    Laser rx power high alarm threshold       : 3.1622 mW / 5.00 dBm
    Laser rx power low alarm threshold        : 0.0199 mW / -17.01 dBm
    Laser rx power high warning threshold     : 1.9952 mW / 3.00 dBm
    Laser rx power low warning threshold      : 0.0316 mW / -15.00 dBm
aep commented 2 years ago

err....

running onlpdump makes the links come up.

[  253.028020] Aldrin2 0000:01:00.0 swp47: Link is Up - 10Gbps/Full - flow control off
[  253.030557] Aldrin2 0000:01:00.0 swp48: Link is Up - 10Gbps/Full - flow control off
[  253.035743] IPv6: ADDRCONF(NETDEV_CHANGE): swp47: link becomes ready
[  253.050293] IPv6: ADDRCONF(NETDEV_CHANGE): swp48: link becomes ready

i tested this 3 times now to make extra sure i'm not imagining it.

  1. cold boot
  2. ip link set swp47 up
  3. observe that it is linkdown
  4. onlpdump
  5. ip link now shows LOWER_UP and the link works fine
sonoble commented 2 years ago

We have seen that behavior and filed a bug with Marvell. Will follow up and see if it is fixed in 3.1.0 rc1.

paulmenzel commented 2 years ago

We have seen that behavior and filed a bug with Marvell. Will follow up and see if it is fixed in 3.1.0 rc1.

Thank you for escalating it.

Sorry OT: It’d be great, if you used the public bug trackers, so that the whole community can benefit and participate.

jmpolom commented 2 years ago

We have seen that behavior and filed a bug with Marvell. Will follow up and see if it is fixed in 3.1.0 rc1.

@sonoble where was this bug filed?

This really needs to be a more transparent process since this is purportedly an open source project. I do not see this issue on the public switchdev-prestera tracker.

This is going to become a major usability issue that prevents further adoption of DENT offerings. A better solution is required here. Users need a way to directly engage with those who develop critical pieces of software and the Prestera driver is certainly one of them. How do we work with Marvell to open things up here?

cc: @storrgie @trishan @lperkov

Mickey201 commented 2 years ago

Hi @jmpolom, I think you run too fast to a wrong conclusions.

Hi @aep, the scenario you specified sounds familiar. Adding Accton team member: @richardlee66 We suspect you might have an old CPLD version. Please read the platform CPLD version according to the below procedure: (From U-Boot command line)

Marvell>> i2c dev 2

Major CPLD number:

Marvell>> i2c md 0x40 01 1 0001: 01

Minor CPLD number:

Marvell>> i2c md 0x40 ff 1 00ff: 03

Per the example above - The CPLD version is 1.03 The most updated CPLD version used on Marvell LAB is 1.09 If you have an older version - please contact @richardlee66 from Accton team.

paulmenzel commented 2 years ago

Hi @jmpolom, I think you run too fast to a wrong conclusions.

Why? If it’s a known bug, why isn’t the problem and solution documented? What is the problem actually with older CPLD versions?

Hi @aep, the scenario you specified sounds familiar. Adding Accton team member: @richardlee66 We suspect you might have an old CPLD version. Please read the platform CPLD version according to the below procedure: (From U-Boot command line)

If there are known problems, why does the Linux kernel driver not check the CPLD version, and warn about it in the log files?

[…]

Mickey201 commented 2 years ago

@paulmenzel, assuming my concern about the CPLD is right - this is not a bug. It means @aep probably has a platform with ENG CPLD image. We should ask the Accton team how they manage their platform's support - but defining it as a Marvell Switchdev Driver bug is wrong.

If there are known problems, why does the Linux kernel driver not check the CPLD version, and warn about it in the log files?

The CPLD Driver is a platform driver handled by the Accton team - please consult with them. Marvell Switchdev driver has no direct interface with the CPLD.

jmpolom commented 2 years ago

Hi @jmpolom, I think you run too fast to a wrong conclusions.

Why? If it’s a known bug, why isn’t the problem and solution documented? What is the problem actually with older CPLD versions?

My comment was based on earlier comments seeming to suggest a driver issue and also the comment from @sonoble suggesting a bug was filed somewhere that isn’t public. Maybe that is not accurate but I’d like to see some explanation either way.

Generally there hasn’t been a specifically identified support entry point for the Marvell-based DENT platforms. IE: if you have a hardware issue, where does a user ask for help? It has seemed to default to this issue tracker but that really is quite messy and should be better thought out. This issue tracker should not be used to provide end user device support and also to coordinate the development of an OS. It lumps a ton of disparate things into one bin and will become increasingly more difficult/painful/undesirable to interact with.

aep commented 2 years ago

Please read the platform CPLD version according to the below procedure: (From U-Boot command line)

Marvell>> i2c dev 2
Setting bus to 2
Marvell>> i2c md 0x40 01 1
0001: 01    .
Marvell>>  i2c md 0x40 ff 1
00ff: 05    .

It means @aep probably has a platform with ENG CPLD image.

this is a regular production device from Accton. If dentos only wants to support specific revisions of hardware, it would be nice to have that documented, so we can purchase the correct revision in the future.

This issue tracker should not be used to provide end user device support and also to coordinate the development of an OS

they're the same thing. Unless you're specifically suggesting that dentos doesn't accept outside contributions, which would explain trivial bugfix PRs being ignored.

Again, i would really appreciate if the purpose of dent is better documented. The overall tone appears to be that this is actually internal to some corporate agreement rather than for general use. Otherwise we'll have to find workarounds for the silicon that happens to be out there, as we traditionally do in linux.

sonoble commented 2 years ago

Hi Jon,

Yes my response was that we had seen something similar, but we had to confirm it was fixed in the firmware. Mickey and others were able to determine that the issue was in the cpld and we had good results testing yesterday. I apologize for not immediately updating the ticket.

On Tue, Nov 23, 2021, 8:22 AM Jon Polom @.***> wrote:

Hi @jmpolom https://github.com/jmpolom, I think you run too fast to a wrong conclusions.

Why? If it’s a known bug, why isn’t the problem and solution documented? What is the problem actually with older CPLD versions?

My comment was based on earlier comments seeming to suggest a driver issue and also the comment from @sonoble https://github.com/sonoble suggesting a bug was filed somewhere that isn’t public. Maybe that is not accurate but I’d like to see some explanation either way.

Generally there hasn’t been a specifically identified support entry point for the Marvell-based DENT platforms. IE: if you have a hardware issue, where does a user ask for help? It has seemed to default to this issue tracker but that really is quite messy and should be better thought out. This issue tracker should not be used to provide end user device support and also to coordinate the development of an OS. It lumps a ton of disparate things into one bin and will become increasingly more difficult/painful/undesirable to interact with.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dentproject/dentOS/issues/152#issuecomment-976797192, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAP5WH7EDYTJ7BYZSPEAPADUNO5VBANCNFSM5IJFVTLQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sonoble commented 2 years ago

Hi Arvid,

There is no issue with the hardware, it is an issue with the cpld version. The cpld software is supported by the odm. Do you have a contact at Acton/edge-core that you can work with or a reseller?

The suggestion is to update to the latest cpld. Please note this is not common but bugs may not be exposed until new features are implemented. The cpld update procedure is straight forward and is done from onie, it will not modify the dentos installation or configuration.

On Tue, Nov 23, 2021, 9:04 AM Arvid E. Picciani @.***> wrote:

Please read the platform CPLD version according to the below procedure: (From U-Boot command line)

Marvell>> i2c dev 2 Setting bus to 2 Marvell>> i2c md 0x40 01 1 0001: 01 . Marvell>> i2c md 0x40 ff 1 00ff: 05 .

It means @aep https://github.com/aep probably has a platform with ENG CPLD image.

this is a regular production device from Accton. If dentos only wants to support specific revisions of hardware, it would be nice to have that documented, so we can purchase the correct revision in the future.

This issue tracker should not be used to provide end user device support and also to coordinate the development of an OS

they're the same thing. Unless you're specifically suggesting that dentos doesn't accept outside contributions, which would explain trivial bugfix PRs being ignored.

Again, i would really appreciate if the purpose of dent is better documented. The overall tone appears to be that this is actually internal to some corporate agreement rather than for general use. Otherwise we'll have to find workarounds for the silicon that happens to be out there, as we traditionally do in linux.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dentproject/dentOS/issues/152#issuecomment-976858905, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAP5WHZIUP5367IXW5NVUD3UNPCRBANCNFSM5IJFVTLQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jmpolom commented 2 years ago

This issue really highlights some major deficiencies with the DENT Project that need to be resolved sooner rather than later if we want users to stick around. I see the following as questions that need to be answered:

Again, i would really appreciate if the purpose of dent is better documented. The overall tone appears to be that this is actually internal to some corporate agreement rather than for general use. Otherwise we'll have to find workarounds for the silicon that happens to be out there, as we traditionally do in linux.

I’d generally agree that the feeling the rest of us users are left with is that we are simply bystanders to someone else’s corporate objectives. We do not have clearly identified avenues for support with many of the major players here and it leads to a pretty lousy experience. This must be improved or we risk alienating existing users and denying future ones.

aep commented 2 years ago

Do you have a contact at Acton/edge-core that you can work with or a reseller?

I have requested support from the reseller, but accton is not a brand that cares about longevity of their products. We only buy from them because it's the only dentos device available right now.

The suggestion is to update to the latest cpld.

Can we collect the clpd images in a repo, similar to how we do it for firmware? The e-waste problem here is rather tragic if getting dent to work requires vendor support.

taraschornyiplv commented 2 years ago

@aep until you will get updated cpld image try to build an image w/o this commit.

paulmenzel commented 2 years ago

Commit fee5b08fc6 (Modify RX loss active high for CPLD RX loss definition correction) has no commit message body describing the problem and the fix. And also does not mention the side effect with older CPLD versions.

Mickey201 commented 2 years ago

You're right @paulmenzel, I think Taras only wanted to assist @aep to enable the system until Edge-Core will engage. @richardlee66, @brandonchuang, please join this discussion and provide your insight about Edge-core support plans.

aep commented 2 years ago

Reseller responded that Accton does not release updated CLPD images for the dentos line, so these machines are dead on arrival unless the dentos community can somehow agree on a workaround.

This line in the offending commit unfortunately confirms the sentiment of the rest of this thread.

+ It is currently not required for Amazon 'ethtool -m' support but it is intended for future use.

We could maintain a community fork of dentos that works outside of Amazon, but i'm not sure if there's even interest.

demliu commented 2 years ago

@aep Our Edgecore engineers looked at this issue. They are not sure this is CPLD issue. Latest CPLD version is here, https://accpartner.accton.com/sites/csp/ONIE/AS5114-48X/CPLD/as5114-bptfr_v01.01.09h_as4224-bptfr_v0c.02.0ah.updater. This link needs customer log in. It’s an ONIE updater, you can use ONIE upgrade method in ONIE system.

aep commented 2 years ago

Our Edgecore engineers looked at this issue

thank you @demliu , i'm sure you realize this is not a particularly useful response, since there's nothing anyone outside of Accton can do about it. The CPLD isn't documented, and the sources aren't available. The commit that broke it is Accton specific. I'll happily help debugging the issue here, in public.

This link needs customer log in.

Please make this publicly available, or fix dentos main to work with all devices you shipped.

This open source project won't work if basic functionality requires an NDA. But if dentos doesnt work, there's no benefit to building datacenters with Accton devices in the first place. This is literately the only reason i'm pushing for giving Accton a chance again.

aep commented 2 years ago

@demliu

I now received the CPLD from the reseller after pressure from Marvell. This might fix my issue but i'm not willing to purchase more devices until edge-core commits to making them publicly available. Dentos cannot be successful if this level of escalation is required for everyone participating. If dentos doesnt work we will stick to cisco, who have great support.

Please make them publicly available or commit to a stable ABI.

jmpolom commented 2 years ago

I now received the CPLD from the reseller after pressure from Marvell. This might fix my issue but i'm not willing to purchase more devices until edge-core commits to making them publicly available.

Is this an update to the CPLD firmware itself that you received?

demliu commented 2 years ago

@aep Thank you for the feedback! Edgecore is not yet publicly open to offer tech support for DENT. As a customer, you still can get available services for the tech support team. Appreciate your feedback about DENT and will let our team know.

paulmenzel commented 1 year ago

@aep, thank you again very much for debugging the issue, we also seem to have run in with one device with firmware Aldrin2 firmware 3.0.1 and CPLD firmware 1.05 – same as you. One of our three devices started to show this issue once – no restart yet and the link just dropped. Before it worked fine. For you it was a little different, right? All the ports (besides management) never worked, didn’t they?

Before we go through updating our devices, did you apply the update, and did it fix the issue?

aep commented 1 year ago

yes, the switch doesnt come up without the binary blobs. you need to have an exact match between dentos and the binaries, which aren't public, and we dont know which ones dentos devs test.

DENTOS never made it out of the lab unfortunately. We're too small for being able to make a single vendor (Accton) to give us the required binary blobs. Only through pressure from Marvell they gave us anything. once. The second potential vendor (Delta) doesnt even want to sell us anything.

As to your question on the ML: I think the CPLD is just for board specific things like pinouts, power, idk. They could probably just open source it if they wanted to.

If there was a large enough community, i'm sure we could convince marvel to just sell us the chip. the rest of the board is trivial. but i'm not seeing any traction that would make that a compelling argument. And if there was a relevant community, Accton would probably also be convinced to just publish the blobs. TLDR: unless you're facebook, give up.