cisco / exanic-software

ExaNIC drivers, utilities and development libraries
Other
144 stars 82 forks source link

exanic-clock-sync have 300ns bias #76

Open hchechao2 opened 1 year ago

hchechao2 commented 1 year ago

exanic-software-2.7.3 on rhel9.0-5.14-70.13.1.0.3, use exanic-clock-sync or ptp4l both have 300ns bias between HW timestamp and System clock. is there anybody has a clue on this ?

bai-jian commented 1 year ago

I met the same bug, and no clues.

miland-magmio commented 1 year ago

How do you verify or check that there's a 300ns bias between the HW timestamp and the system clock?

If you're using exanic-clock-check, please note that this utility isn't precise. It takes the system clock in microseconds, so the difference you're seeing there is just a rounding error.

If you have another method for verifying the FPGA clock and host clock are in sync, please let me know.

EDIT: earlier version of the post mentioned exanic-clock-sync not being precise, it should have been exanic-clock-check

vient commented 1 year ago

If you use exanic-clock-check, note that it is really simple and does not account for PCI latency which may very well be around 300ns.

Alexxstud commented 1 year ago

If you're using exanic-clock-sync, please note that this utility isn't precise. It takes the system clock in microseconds, so the difference you're seeing there is just a rounding error.

@miland-magmio, may you please provide more details on this statement - if a ptp4l or ptp2d is running than OS clock should be in ns precision (offset from master ~ O(10 ns)), why does exanic-clock-sync take micros? There is almost no documentation on exanic-clock-sync, did you look in the code?

How do you verify or check that there's a 300ns bias between the HW timestamp and the system clock?

OP is probably referring to the results of exanic-clock-check after exanic-clock-sync --daemon exanic$n:sys.

miland-magmio commented 1 year ago

If you're using exanic-clock-sync, please note that this utility isn't precise. It takes the system clock in microseconds, so the difference you're seeing there is just a rounding error.

@miland-magmio, may you please provide more details on this statement - if a ptp4l or ptp2d is running than OS clock should be in ns precision (offset from master ~ O(10 ns)), why does exanic-clock-sync take micros? There is almost no documentation on exanic-clock-sync, did you look in the code?

How do you verify or check that there's a 300ns bias between the HW timestamp and the system clock?

OP is probably referring to the results of exanic-clock-check after exanic-clock-sync --daemon exanic$n:sys.

I asked Cisco support, and they confirmed exanic-clock-check uses microseconds for the host clock. Also, if you just look carefully at exanic-clock-check output, you can see the host timestamp is in microseconds, and the reported difference is always the nanosecond portion of the FPGA clock:

Device exanic0: 545692285137547128 ticks (1693299828270678372 ns since epoch)
Host clock: 1693299828270678 us since epoch
Difference: 372 ns

Notice the exanic0 time ends in 372 ns, and the reported difference is 372 ns. It's always like that. But it doesn't really say anything about the offset between the exanic clock and the host clock.

If you run exanic-clock-sync in foreground, it actually reports the offset. You can also enable syslog logging of the difference with --syslog.

exanic0: Starting clock discipline using system clock
exanic0: Current TAI offset is 0
exanic0: Clock offset from system: -43.528 us  drift: 0.000 ppm
exanic0: Clock offset from system: -9.245 us  drift: -9.277 ppm
exanic0: Clock offset from system: -0.015 us  drift: -9.244 ppm
exanic0: Clock offset from system: -0.009 us  drift: -9.236 ppm
exanic0: Clock offset from system: -0.003 us  drift: -9.246 ppm

That being said, we have a client reporting about 300-400ns difference between HW timestamp from a Solarflare card and Exanic card for the same packet received from a L1 switch, which would suggest the exanic is actually 300-400ns off from the PTP clock, but I'm not sure how to debug that (exanic-clock-sync reports single digit ns differences like above).

miland-magmio commented 1 year ago

If you're using exanic-clock-sync, please note that this utility isn't precise. It takes the system clock in microseconds, so the difference you're seeing there is just a rounding error.

@miland-magmio, may you please provide more details on this statement - if a ptp4l or ptp2d is running than OS clock should be in ns precision (offset from master ~ O(10 ns)), why does exanic-clock-sync take micros? There is almost no documentation on exanic-clock-sync, did you look in the code?

How do you verify or check that there's a 300ns bias between the HW timestamp and the system clock?

OP is probably referring to the results of exanic-clock-check after exanic-clock-sync --daemon exanic$n:sys.

Sorry, I actually meant exanic-clock-check in my first post, not exanic-clock-sync. The exanic-clock-sync daemon should indeed use nanoseconds, at it also reports the drift in nanoseconds.

Alexxstud commented 1 year ago

Thanks, now it is clear.

hchechao2 commented 1 year ago

1.Default exanic-clock-check use gettimeofday,which only has usec accuracy, and I changed it to use clock_gettime to see 300ns bias 2.But actually exanic-clock-check is not the reason I doubt it, the reason why I found it has bias was because I used light splitter to compare it's HW timestamp with another reference NIC. 3.exanic-clock-sync or ptp4l they both use kernel interface and this "300ns" is related to PCIE performance which vary in server platform. 4.I have used some tricks to do correction in exanic-ptp kernel module