fmadio / public

fmadio issue tracking
MIT License
8 stars 3 forks source link

f80 qsfp port fails to linkup #254

Closed fmadio closed 6 years ago

fmadio commented 6 years ago

Seems if theres a discerete clock jump via fnic_clock the 322mhz clock completely blows up. Guess its not synchronizing the data as it cross from 250->322mhz domains causing the time to get way off that blows up the iterative adjustment

fmadio commented 6 years ago

Its happening because NTP is slewing the clock aggressively. Causing the 250mhz clock to be behind ~ 550usec, while the 322mhz is slower to adapt and is ~ 27msec behind.

Whats worse is the 322mhz clock is slewing the opposite direction to the 250mhz clock. As the time jump update is based on the 250mhz clock

fmadio commented 6 years ago

fnic_clock was not checking diff on the 322mhz domain. What it needs to do also is when there is a forced time set it should reset the default clock phase and reset the EMA averages.

fmadio commented 6 years ago

have reproduced the issue locally. Top logfile shows the correct phase of the 322mhz clock, where as the problems show up when the clock phase is way way off ... 250ps off..

image

fmadio commented 6 years ago

In addition, the phantom packet sizes show up. it should be all 8192-9216 with nothing else...

image

fmadio commented 6 years ago

what actually gets written is kind of interesting too

not exactly sure where the 01000000 payload is being sourced from.

image

fmadio commented 6 years ago

Seems this is clearly a 40G PCS Clock generation/stability most likely reset seq issue

fmadio commented 6 years ago

soft reset does not clear the error

fmadio commented 6 years ago

clock check after weirdness. bitstream has been re-programmed

fmadio commented 6 years ago

edit: tried on a different run when it fails to link up. the result is the same, ref clocks are good

image

fmadio commented 6 years ago

qsfp rx clocks are pretty strange they should be roughly in sync with the Tx clock

image

fmadio commented 6 years ago

qsfp0 rx clock is clearly slower than it should be. seems like this is a problem with the QSFP transceiver reset sequence.

image

fmadio commented 6 years ago

Appears it just fails to link up on port0

image

fmadio commented 6 years ago

using GTH Loopback mode, clocks are stable and it links up.

highly suspect this is a QSFP module bug. even though the QSFP reset pin goes high during link initialization.. QSFP module still refuses to link up.

fmadio commented 6 years ago

Reference good counters

image

fmadio commented 6 years ago

Transceiver information: Port 1 has the old fb trasnciver. Its not clear if the problem is on the Tx or Rx side..

fmadio@fmadio80v1-095:/sys/devices/pci0000:00/0000:00:03.0/0000:05:00.0$ cat qsfp0_module Present : yes LinkState : up Temperature : 44.000 C Voltage : 3.207 V Vendor : fmadio PartNo : QSFP-SR4-40G Mode : 40G-SR4 VCSEL Wavelength : 850nm PCSStatus : 3000000003fffc43 fmadio@fmadio80v1-095:/sys/devices/pci0000:00/0000:00:03.0/0000:05:00.0$ cat qsfp1_module Present : yes LinkState : up Temperature : 34.936 C Voltage : 3.276 V Vendor : PartNo : Mode : 40G CR Wavelength : Active PCSStatus : 3000000003fffc43 fmadio@fmadio80v1-095:/sys/devices/pci0000:00/0000:00:03.0/0000:05:00.0$

fmadio commented 6 years ago

added 312.5mhz cycle counter world time check to fnic_clock. if that is out of wack it will force a time sync and not blow up.

fmadio commented 6 years ago

this bug

1) QSFP transceiver ends up in some weird state resulting in the 40GPCS unable to linkup.

2) as the transciver has not linked up the output 312.5mhz clock is not actually 312.5mhz which causes the 312.5mhz timestamp to drift significantly.

3) QSFP transciver is unable to clear the error, only way is to power cycle the system.

4) have added link uptime + link up counter to the HDL so this situation can be debuged more easily in the future.

closing.

fmadio commented 6 years ago

Updated port status info

image

fmadio commented 6 years ago

continuing this... qsfp*_reset_n pins were getting toggled by the IIC controller on the fpga. after removing this there has not been any link issue.

suspect the quick < 1usec reset toggle was not appreciated by the qsfp transcivers and led them to enter some kind of unusual state.

HDL update, will keep an eye out for this again.

fmadio commented 6 years ago

total power cycles 168, total failures 0.

previously failure occurred within ~ 20-30 power cycles.