Closed fmadio closed 6 years ago
Its happening because NTP is slewing the clock aggressively. Causing the 250mhz clock to be behind ~ 550usec, while the 322mhz is slower to adapt and is ~ 27msec behind.
Whats worse is the 322mhz clock is slewing the opposite direction to the 250mhz clock. As the time jump update is based on the 250mhz clock
fnic_clock was not checking diff on the 322mhz domain. What it needs to do also is when there is a forced time set it should reset the default clock phase and reset the EMA averages.
have reproduced the issue locally. Top logfile shows the correct phase of the 322mhz clock, where as the problems show up when the clock phase is way way off ... 250ps off..
In addition, the phantom packet sizes show up. it should be all 8192-9216 with nothing else...
what actually gets written is kind of interesting too
not exactly sure where the 01000000 payload is being sourced from.
Seems this is clearly a 40G PCS Clock generation/stability most likely reset seq issue
soft reset does not clear the error
clock check after weirdness. bitstream has been re-programmed
edit: tried on a different run when it fails to link up. the result is the same, ref clocks are good
qsfp rx clocks are pretty strange they should be roughly in sync with the Tx clock
qsfp0 rx clock is clearly slower than it should be. seems like this is a problem with the QSFP transceiver reset sequence.
Appears it just fails to link up on port0
using GTH Loopback mode, clocks are stable and it links up.
highly suspect this is a QSFP module bug. even though the QSFP reset pin goes high during link initialization.. QSFP module still refuses to link up.
Reference good counters
Transceiver information: Port 1 has the old fb trasnciver. Its not clear if the problem is on the Tx or Rx side..
fmadio@fmadio80v1-095:/sys/devices/pci0000:00/0000:00:03.0/0000:05:00.0$ cat qsfp0_module Present : yes LinkState : up Temperature : 44.000 C Voltage : 3.207 V Vendor : fmadio PartNo : QSFP-SR4-40G Mode : 40G-SR4 VCSEL Wavelength : 850nm PCSStatus : 3000000003fffc43 fmadio@fmadio80v1-095:/sys/devices/pci0000:00/0000:00:03.0/0000:05:00.0$ cat qsfp1_module Present : yes LinkState : up Temperature : 34.936 C Voltage : 3.276 V Vendor : PartNo : Mode : 40G CR Wavelength : Active PCSStatus : 3000000003fffc43 fmadio@fmadio80v1-095:/sys/devices/pci0000:00/0000:00:03.0/0000:05:00.0$
added 312.5mhz cycle counter world time check to fnic_clock. if that is out of wack it will force a time sync and not blow up.
this bug
1) QSFP transceiver ends up in some weird state resulting in the 40GPCS unable to linkup.
2) as the transciver has not linked up the output 312.5mhz clock is not actually 312.5mhz which causes the 312.5mhz timestamp to drift significantly.
3) QSFP transciver is unable to clear the error, only way is to power cycle the system.
4) have added link uptime + link up counter to the HDL so this situation can be debuged more easily in the future.
closing.
Updated port status info
continuing this... qsfp*_reset_n pins were getting toggled by the IIC controller on the fpga. after removing this there has not been any link issue.
suspect the quick < 1usec reset toggle was not appreciated by the qsfp transcivers and led them to enter some kind of unusual state.
HDL update, will keep an eye out for this again.
total power cycles 168, total failures 0.
previously failure occurred within ~ 20-30 power cycles.
Seems if theres a discerete clock jump via fnic_clock the 322mhz clock completely blows up. Guess its not synchronizing the data as it cross from 250->322mhz domains causing the time to get way off that blows up the iterative adjustment