Microchip-Ethernet / EVB-KSZ9477

Repository for using Microchip EVB-KSZ9477 board. Product Supported: KSZ9477, KSZ9567, KSZ9897, KSZ9896, KSZ8567, KSZ8565, KSZ9893, KSZ9563, KSZ8563, LAN9646, Phys(KSZ9031/9131, LAN8770
76 stars 78 forks source link

Inconsistent port status - Link issue - Transmit failure #32

Open Letrab opened 4 years ago

Letrab commented 4 years ago

Hello,

We are evaluating the KSZ9477 and ATAMA5D36 (with EVB-KSZ9477 UNG_8071 rev B). Having a clean, stock build from this repository (kernel 4.9), we are having link issues when interconnecting development boards.

It is not consistently reproducible, nevertheless there is an easy way to trigger the issue:

PC - DEV1 - DEV2 - DEV3 Connect your interface to DEV1, random port. Link trough DEV2, random port. Link trough DEV3, random port.

Let PC generate some traffic (we broadcast some UDP packets). Ping all development boards. Repeatably disconnect and reconnect random links.

After a while (let's say 20 link toggles), suddenly one (or multiple, depends on which port is "broken") pings are not replied anymore.

When connecting to the "faulty board" we see at this broken port, by getting it's mib counters (e.g. cat /sys/class/net/eth0/sw3/3_mib) that tx_discards are increasing. We haven't found a way recovering from this issue other than rebooting the board.

Things tried:

• Force power off/on toggle on PHY:

echo 0 > /sys/class/net/eth0/sw3/3_power
echo 1 > /sys/class/net/eth0/sw3/3_power

• Force new gigabit negotiation of PHY:

echo 1000 > /sys/class/net/eth0/sw3/3_speed
echo 0 > /sys/class/net/eth0/sw3/3_speed

• Toggle duplex (although this should have the same effect as speed):

echo 1 > /sys/class/net/eth0/sw3/3_duplex
echo 0 > /sys/class/net/eth0/sw3/3_duplex

• Toggle transmit:

echo 0 > /sys/class/net/eth0/sw3/3_tx
echo 1 > /sys/class/net/eth0/sw3/3_tx

(Unloading/loading the driver has not been tested yet)

We tried as well using the DSA driver and seemed not able to reproduce the issue. So we might expect (and hope), it's somewhere a driver (software) issue?

Is it advisable to use the DSA driver? Is it complete? Is it possible to run PTP on the DSA lan interfaces?

Many thanks in advance! Kind regards.

danglin44 commented 4 years ago

We see the same issue. We opened a case with Microchip a couple of months ago but there has been no resolution so far.

Letrab commented 4 years ago

Thanks for your confirmation, we're not alone. Did you had any follow up on the issue? Any workaround so far?

@triha2work , are you involved in this issue as well? Can we help debugging or something?

Thanks!

danglin44 commented 4 years ago

On 2019-11-06 2:39 p.m., Bartel wrote:

Thanks for your confirmation, we're not alone.

For what it's worth, the case is # 00460748.

Did you had any follow up on the issue? Any workaround so far?

There have been three individuals on case.  They have have pointed us to errata modules 17 and 20. However, there has been no suggestions as to how to work around a stuck transmitter queue.

We tried resetting the PHY and some other things but they didn't help.  I believe it's a hardware issue in the switch.

It's in some way related to traffic.  I have an AVB setup between two ports and can trigger the problem with three or four disconnects/reconnects.  We are also able to trigger the problem doing a big file transfer from a NAS server to a PC.

Your comment about DSA made me wonder if this is a race condition in removing DSA tags.

We need DSA for our application.  I spent some time fixing the memory leaks in the KSZ ptp4l implementation. and we designed a board around the KSZ9477.  However, this is all on hold for now.

@triha2work https://github.com/triha2work , are you involved in this issue as well? Can we help debugging or something?

Ragards, Dave Anglin

-- John David Anglin danglin@luxcom.com Luxcom Technologies Inc.

Letrab commented 4 years ago

Thanks a lot for your insights Dave! Much appreciated!

Just one thing to make sure: the DSA I was referring to was the Distributed Switch Architecture driver, so not the "custom device driver". Can you confirm you are using this DSA driver as well?

As I seemed not be able to reproduce the issue when using the DSA driver, only using the custom device driver. But if so, I will recheck.

danglin44 commented 4 years ago

On 2019-11-06 4:04 p.m., Bartel wrote:

Just one thing to make sure: the DSA I was referring to was the Distributed Switch Architecture driver, so not the "custom device driver". Can you confirm you are using this DSA driver as well?

We have the following in Linux .config: CONFIG_STP=y CONFIG_BRIDGE=y CONFIG_BRIDGE_IGMP_SNOOPING=y

CONFIG_BRIDGE_VLAN_FILTERING is not set

CONFIG_HAVE_NET_DSA=y

CONFIG_NET_DSA is not set

I believe "DSA support" on KSZ9477 requires adding trailing tags on packets to/from the CPU.  This is needed for STP, PTP, etc.  This is tail tagging mode.

I don't think tail tag mode supports interconnected devices.  It's only one byte in to CPU direction. One would need a Marvell switch to support full Distributed Switch Architecture.  The Marvell DSA tag has switch ID in tag and the inital input port/trunk is passed through to CPU.  There is also a port mapping table to setup.

Dave

-- John David Anglin danglin@luxcom.com Luxcom Technologies Inc.

Letrab commented 4 years ago

I think in your configuration DSA is not being used. The CONFIG_HAVE_NET_DSA does not compile extra files. It only enables some dependencies (which probably are not used, as CONFIG_NET_DSA is not enabled.

Thank you on your lead on errata 20.

We noticed that by disabling PTP all-together, we could not reproduce the issue. (Global PTP Message Config 1 Register , setting al bits to 0, being reg 0x0515)

This got us wondering.

By looking at this register, Global PTP Message Config 1 Register, 0x0515, we noticed bit 0, being PTP 1/2-step is set to 0 in default configuration, being 2-step which contradicts the errata. By setting this to 1-step (setting the bit to 1), our problem (the above use case scenario) seems to be resolved. But the errata mentions that some PTP packets are being dropped, but it seems in a case that all packets are being dropped...

We noticed that the problem was propagating towards the PTP master clock device. E.g. if the problem happened on DEV3 above, and you would reset it, the problem seems to occur all the time on DEV2. By resetting DEV2, the problem occurs on DEV1. By resetting DEV1 the whole system is functioning properly again.

Doesn't seem to be an 802.1AS specific problem (as we disabled 802.1AS and it still occurred).

Can you try this as well maybe @danglin44 ?

Thanks!

danglin44 commented 4 years ago

Hi Bartel,

On 2019-11-07 8:17 a.m., Bartel wrote:

Thank you on your lead on errata 20. This got us wondering. By looking at register 0x0515, we noticed bit 0, being PTP 1/2-step is set to 0 in default configuration, being 2-step which contradicts the errata. By setting this to 1-step (setting the bit to 1), our problem (the above use case scenario) seems to be resolved. But the errata mentions that /some PTP packets/ are being dropped, but it seems in a case that /all packets/ are being dropped...

Bit 0 in register 0x515 is 0 in my setup.  Presumably, the bit is set to 0 because twoStepFlag is 1 (default) in ptp4l config. I have assume_two_step set to 1 in espressobin config (see below).

I set the 1/2-step bit to 1 with regs_bin.  The two biamp AVB units were still happy and I didn't see any lost timestamps. I tried a number of cable disconnects and reconnects and the AVB connection came back every time.  I then hooked up an espressobin board with Linux 5.3.8 (Marvell DSA driver is hacked) and trunk ptp4l.  It didn't sync with EVB-KSZ9477. I reset PTP 1/2-step bit to 0 and it synced.  I then set the bit again and it lost PTP sync (it went to master).  I then power cycled the biamp receiving AVB packets.  The port connected to it locked up and started dropping all packets.

So, it seems one-step mode does not fully resolve the transmit failure.  It also causes compatibility issues with PTP devices that operate in two step mode.

It could be the PTP timestamp handling somehow causes the output queues to stall and drop all packets.

-- John David Anglin dave.anglin@bell.net

danglin44 commented 4 years ago

It was suggested that I try the following and see if it fixes the issue: https://microchip.my.salesforce.com/sfc/p/o0000000KAkK/a/3l000000gSdH/LG.sszSVrpC.IaIcq9_OfZL1ySTFgTIqWaeyPskOkGE

wmoors commented 4 years ago

@danglin44 gPTP (802.1AS) is spec'd as 2-step, so any kind of frame validation on other stack can potentially dismiss these frames. Or just make the sync incorrect by ignoring the correctionfield in the header as 2-step specifies the follow-up messages containing the residence times in their correction field.

It's hard to say what exactly is causing the BMCA to fail with the espressobins, but gut feeling says it's going to be the AsCapable that will be false. (so peer delay measurement going wrong)

1588 however, clearly states that 1-step and 2-step should be able to be mixed. So any clock (ordinary/boundary) should take the correctionfields of both frames (sync and follow-up) to calculate offsets.

So I fear this 1-step hack is no good solution in an AVB setup. For 1588 I think it is...

danglin44 commented 4 years ago

Port 2 on espressobin is connected to EVB with 1-step bit set in register 0x0515:

Nov 8 11:12:44 localhost ptp4l: [75346.678] port 2: delay timeout Nov 8 11:12:44 localhost ptp4l: [75346.682] negative delay -140923476 Nov 8 11:12:44 localhost ptp4l: [75346.682] delay = (t2 - t3) * rr + (t4 - t1) Nov 8 11:12:44 localhost ptp4l: [75346.682] t2 - t3 = +121 Nov 8 11:12:44 localhost ptp4l: [75346.682] t4 - t1 = -281847074 Nov 8 11:12:44 localhost ptp4l: [75346.682] rr = 0.000305848 Nov 8 11:12:44 localhost ptp4l: [75346.682] delay filtered -890379929 raw - 140923476 Nov 8 11:12:44 localhost ptp4l: [75346.709] port 1: master sync timeout Nov 8 11:12:44 localhost ptp4l: [75346.791] port 2: master sync timeout

So, it looks like peer delay measurement going wrong.

Forcing asCapable didn't help. Same for assume_two_step and follow_up_info.

1-step isn't supported by espressobin hardware (88E6341) as far as I know.

So, I agree 1-step hack is not a good solution for AVB.

danglin44 commented 4 years ago

Microchip has forwarded the case to its internal team and increased the severity to serious.

I tried the sama5_ksz_dsa_defconfig available with linux-4.9.143. However, it didn't work. Didn't try to figure out what was wrong.

wmoors commented 4 years ago

I haven't looked into the dsa defconfig, but I think it doesn't even create the /dev/ptp0 device, so I don't think that will work for any PTP related setup. :) As far as my colleague (@Letrab) tested, DSA is just good enough for a managed 802.1Q bridge. Nothing PTP or AVB related is supported.

Nevertheless, Microchip taking a better look at the issue is good news!

danglin44 commented 4 years ago

Microchip's internal team said the transmit failure is caused by using 2-step mode.

There is a use_one_step config option in gPTP.cfg. When it is enabled, my EVB syncs with espressobin but not with Biamp TESIRA units.

danglin44 commented 4 years ago

Tried three more AVB devices with EVB. None sync when use_one_step is 1.

Latest PDelay patch didn't help.

It looks to me like best master clock selection is broken when use_one_step is 1.

danglin44 commented 4 years ago

There is only one place in port.c that uses clock_one_step(). Disabling this hunk breaks espressobin but now Biamp TESIRA sync.

    if (0 && clock_one_step(p->clock)) {
            if (get_hw_version(p->clock) < 2) {
                    rsp->header.reserved2 = ((m->hwts.ts.tv_sec & 3) << 30)
                            | m->hwts.ts.tv_nsec;
                    rsp->header.reserved2 = htonl(rsp->header.reserved2);
            }
            if (get_hw_version(p->clock) >= 2) {
                    p->p2p_sec = m->hwts.ts.tv_sec;
                    p->p2p_nsec = m->hwts.ts.tv_nsec;
            }
    }
danglin44 commented 4 years ago

The conclusion of this saga is that the device is not gPTP compliant and is not recommended for AVB.

Microchip recommends using VSC7513 or VSC7514 for AVB applications.