NXP / isochron

Tool for Time Sensitive Networking testing
GNU General Public License v2.0
42 stars 13 forks source link

Bizarre Uplink Latency measurements #19

Open stefan-ramdhan opened 11 months ago

stefan-ramdhan commented 11 months ago

Hello,

I'm using isochron to measure the uni-directional latency (from master to slave, and slave to master). I am doing this over a 5G network, so the synchronization quality is not very good. I am seeing that sometimes, latency is being measured as negative (as in, the tx timestamp happens after the rx timestamp). I assume this happens because isochron assumes the devices are "synchronized" but not really, since the Time Error is large, so I figured that I just needed to adjust the latency measurement by isochron by the Time Error (measured by PPS).

When looking at the uplink and downlink latencies (calculated by just taking ts_rx - ts_tx), I noticed that the uplink latency was very bizzare. It looks like this:

image

We see in the picture above that the latency not only goes negative, but is inversely correlated with the PPS output. Even if I were to adjust by the PPS Time Error, we would still see latencies at certain times where they dip down drastically from one second to the other. Do you have any idea why this would be? Why we see those drastic dips in latency?

I adjusted the latency calculation based on the Time Error measured by the PPS:

t_ms = ts_rx - PPS Time Error - ts_tx t_sm = ts_rx + PPS Time Error - ts_tx

Where t_sm is slave to master latency as measured by isochron, ts_rx is the HW timestamp on receive, and ts_tx is the HW timestamp on transmit.:

image

We still see those large dips in latency. Just to be clear, the reason I am wanting to measure uni-directional latency is because I want to measure the delayasymmetry as a function of time. But, I can't trust these measurements as they don't look right.

For reference, the command I'm using for isochron on the sender side in this example is:

sudo isochron send -i enp2s0 -s 64 --client 10.10.10.2 -c 1.0 -t 1 -w 1.0 -F isochron.dat -n 300 -o -O 37 --cpu-mask $((1 << 1)) -4 -J 10.10.10.2 -S 0.0

For practical reasons, I can't use VLAN interfaces, so isochron doesn't have it's own traffic class, because VLAN isn't available to me.

vladimiroltean commented 11 months ago

What synchronization protocol is there between the master and slave? PTP?

vladimiroltean commented 11 months ago

The hardware timestamps are taken on radio interfaces? Do you know what is the timestamping point for these packets, and if that could explain a fundamental delay asymmetry?

stefan-ramdhan commented 11 months ago

Yes, I am using ptp4l and phc2sys.

The hardware timestamps are taken on Intel I210 NICs. Effectively, the 5G system is completely unaware that it is transmitting PTP packets. The master is a standard Ubuntu PC with an Intel I210 NIC. The slave is the same. The NICs connect to a modem which is the interface to the 5G system, but the timestamping happens at the NIC.

There is fundamental delay asymmetry in the 5G system, I'm just trying to measure it. meanPathDelay from linuxptp is an average of the path delays from master->slave and slave->master. I want to find the individual path delays from master->slave, then from slave->master using isochron. The master->slave latency measurement I'm taking using isochron looks very similar to the meanPathDelay from linuxptp.

Here is linuxptp's meanPathDelay:

image

and here is isochron master->slave latency:

image

Does isochron adjust any of its values based on what the offset between master/slave is? I'm guessing even if it did, it can't do that when I use the -o flag. So, is my assumption that I need to adjust isochron measurements by the Time Error correct?

Also, I tried doing this over a simple wireline network (no 5G, just master directly connected to a slave over Ethernet). I'm also seeing negative measurements, but this time, the trend looks reasonable, but obviously latency can't be negative:

image

Weirdly, I'm also seeing negative meanPathDelay measurements coming from linuxptp:

image

This might be an issue with the igb driver. Any ideas what might be causing negative/bizarre latencies?

vladimiroltean commented 11 months ago

Does isochron adjust any of its values based on what the offset between master/slave is? I'm guessing even if it did, it can't do that when I use the -o flag. So, is my assumption that I need to adjust isochron measurements by the Time Error correct?

No adjustment being done, and no adjustment should be necessary. The RX - TX should be the one-way path delay.

The igb driver however does adjust the reported timestamps. For TX: https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/igb/igb_ptp.c#L959 And for RX: https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/igb/igb_ptp.c#L1024

stefan-ramdhan commented 11 months ago

No adjustment being done, and no adjustment should be necessary. The RX - TX should be the one-way path delay.

Isn't that only true when the two devices are synchronized to 0 ns (i.e. the master and slave are on the exact same time base)? Realistically, downlink rx-tx = offset + path_delay, so in order to get the path delay you have to subtract the current offset, especially in a network where ptp4l isn't able to synchronize the devices very well.

Reading the igb_ptp code, it looks like they just adjust it to account for the MAC to PHY delay? (amount of time it takes for the frame to be sent out onto the wire.) I could be wrong. I have seen online that some people have encountered negative path delays while using the igb driver.

Nonetheless, I'm not sure the adjustment made by the igb driver is causing for those drastic changes in latency measurement by isochron, but it's possible.

vladimiroltean commented 11 months ago

Not for the drastic fluctuations, no (since the adjustment value is constant, the error comes from somewhere else). Just for the negative part of it.

To see how, let's refer to any diagram of the Pdelay calculation, and apply (asymmetric) timestamp corrections: https://blog.meinbergglobal.com/wp-content/uploads/2013/09/peer-to-peer-messages1-1.jpg

(t2 - t1) + (t4 - t3) / 2 = Pdelay_1 (t2 + B_rx - t1 - A_tx + t4 + A_rx - t3 - B_tx) / 2 = Pdelay_2

Pdelay_2 - Pdelay_1 = (B_rx - A_tx + A_rx - B_tx) / 2 Assuming both stations are the same (i210): A_tx = B_tx, and A_rx = B_rx Pdelay_2 - Pdelay_1 = A_rx - A_tx

In our case, A_rx is IGB_I210_RX_LATENCY_1000 (448) and A_tx is IGB_I210_TX_LATENCY_1000 (178). So Pdelay_2 - Pdelay_1 is 270 ns.

If the uncorrected Pdelay is smaller than 270 ns, then the corrected one can as well be negative.

vladimiroltean commented 11 months ago

Isn't that only true when the two devices are synchronized to 0 ns?

Yeah, I was assuming precise sync.

stefan-ramdhan commented 11 months ago

If the uncorrected Pdelay is smaller than 270 ns, then the corrected one can as well be negative.

Makes sense. So, there's still the drastic fluctuations I'm seeing in the uplink latency. Weirdly enough, this doesn't always happen. Sometimes, I see completely reasonable uplink and downlink latencies, but this is causing me to doubt whether those measurements are legitimate (before and after adjustment by Time Error).

I do think it's worth noting that I am also measuring latency in 3 ways. First, I am measuring pdelay from linuxptp (that's the average of uplink and downlink latencies), uplink latency from isochron, and downlink latency from isochron.

The downlink latency from isochron, pdelay, and PPS are nearly identical to one another in terms of trend. The confounding variable here is the fact that they are all based on the PTP HW Clock (PHC). I would expect that taking an average of the adjusted uplink latency, and adjusted downlink latency would yield the linuxptp pdelay value, or something near it, But because the downlink latency is always drastically different than the uplink latency, the average of the two is nowhere near the linuxptp pdelay. Because of the asymmetry inherent in 5G networks, the uplink and downlink are completely expected to differ, but their average should be the pdelay.

This is probably out of your area of expertise, so my main concern directed toward you is: do you have a theory as to why the uplink latency measured by isochron seems so wildly incorrect (i.e. negative values after adjustment, and drastic fluctuations)?

stefan-ramdhan commented 10 months ago

I mentioned earlier how I can't use VLAN interfaces due to limitations with my network, which prevents me from using the TAPRIO traffic shaper. Would this make my results meaningless, since isochron expects to have a different traffic class than PTP? Since I'm using this over a 5G network, the HW Timestamp latency is really the only thing I care much about. MAC latencies are negligible compared to the latency over the air.

What I'm doing right now is running isochron send and receive on both machines, and calculating the latency between HW Timestamps using R - T on both machines.

vladimiroltean commented 10 months ago

I'm thinking the apparent negative latency can be caused by PTP, which expects the path delay to be symmetric. But do you actually need to run PTP? You have that out-of-band mechanism of measuring the time error between the uncorrected clocks through PPS, so just use isochron with --omit-sync and collect raw timestamps, then correct one of them using the interpolated PPS time error at that point. The remaining component should be the unidirectional path delay.

Correlating the 2 streams of data might be a problem.

vladimiroltean commented 10 months ago

Would this make my results meaningless, since isochron expects to have a different traffic class than PTP?

It depends on what you want to measure. The program doesn't necessarily expect to run alone on a traffic class.

vladimiroltean commented 10 months ago

do you have a theory as to why the uplink latency measured by isochron seems so wildly incorrect (i.e. negative values after adjustment, and drastic fluctuations)?

I am absolutely confused by the latency figures you've posted being so low (<= 2.5 ns), so there is some physical phenomenon I'm not understanding. But delay asymmetry breaks a lot of the math behind PTP and I simply don't know what to expect.