Dealing with Clock Wander

ddrown commented 8 years ago

Copied from Youtube comment: I took toprecorder/data10.txt data and looked specifically at offset and frequency differences between all the clocks:

Using .241's broadcasts as the "master" clock:

.179 is running 0.316ppm slower
.147 is running 2.218ppm slower
.169 is running 13.519ppm slower and jumped by 508.763 microseconds somewhere between .241's pktids 9254 and 9255
.213 is running 7.877ppm slower and jumped by 508.513 microseconds somewhere between .241's pktids 10715 and 10716

Removing the average frequency differences and the two clock jumps, I get this graph, which shows the clock wander: https://dan.drown.org/clocks/data10.png

Maybe trying to insulate the esp8266's from any airflow would lower their temperature changes, which should lower their clock wander.

Also, maybe using a PID control loop on each node would work to sync the frequencies. This is what I've done with NTP on the esp8266 along those lines: https://github.com/ddrown/Arduino_ClockPID

NTP uses round trip time to try to eliminate the phase offset due to one way latency. I'm not sure that would be needed for this application. Knowing the distances between the fixed points should make it possible to cancel out those terms in the equation.

Lastly, the rx and tx timestamp accuracy will add errors as well, but I haven't measured how accurate they are.

cnlohr commented 8 years ago

I tried simply finding the slew rate over time, then, slowly adjusting it over time to back that back out, but I don't know if I'm doing it right. Do you think you could increase your algorithm to be more general to try to determine how all that goes?

Additionally, can you try verifying the send time using your algorithm, or rather, make sure that send time does not have a great deal of jitter within it? It looks like everything else you have here is gerat for results.

ddrown commented 8 years ago

Ok, I have some updated data here: https://dan.drown.org/clocks/

The data and tools I used are here: https://github.com/ddrown/esp8266rawpackets-proc

Instead of a full PID controller, I'm just calculating rate differences and applying those. The remaining offsets are from one of: receiver jitter, transmitter jitter, or fast clock frequency changes. I'm feeding 32 samples at a time, which works out to about a second and a half worth of data at 22 packets per second.

The end result was: 50% of the time, all clocks were within +14ns -7ns (ignoring phase differences due to propagation delay). 98% of the time, all clocks were within +208ns -135ns.

+/-10ns is about +/-10ft so that might be the accuracy limit.

cnlohr commented 8 years ago

Does the data seem centered around the expected locations of the target ESPs (and differential receive times, i.e. diagonal nodes are (10' difference)? Coincidence? Additionally, can you zoom in on your last two graphs? The data looks /really/ good! It looks like given enough data it should center around the expected locations.

cnlohr commented 8 years ago

I just can't get over how good those last few graphs look, and really hope to be able to zoom in on them!

ddrown commented 8 years ago

Ok, I added a second series of graphs showing the 250ns..-250ns range. I also added a histogram series. - https://dan.drown.org/clocks/

Clock sync has two pieces: phase and frequency. This is just the frequency part, the phase differences aren't handled yet.

cnlohr commented 8 years ago

EDITED

Hmm... Your results are much, much better than mine. I don't know how you got everything to match the skew so well. Considering light travels at ~1ft/ns (why I use feet for this sort of stuff) Those results look /really/ good. What do you suppose causes the groupings of several like-packets periodically? In all of my analysis, I was seeing random meyandering and many, many outliers. You still have outliers, but, you also seem to have bunches of groups of data within the 99th percentile and outside the 25th percentile. Any idea what to attribute that bunching to?

I really can't wait to see what would happen when you do start to correlated this, i.e. use each as a master, and start to correlate the time differences. Actually... That would give you a better time-density, so there would be less drift/shifts between time syncs. Right now it's 30-50ms between packets being sent, if you use all the node tx's, it could go down to ~10ms between syncs to arbitrary nodes. I wonder if that would be much better?

Charles

ddrown commented 8 years ago

The grouping/bunching is probably an artifact of how I'm doing clock sync. I'm not limiting changes from one group of 32 to the next, so a high/low average can throw the whole group off.

The next thing I want to do is apply this clock sync to the data from the other transmitters and see if those offsets are the expected values. The change in distance should show as a straight translation up or down on these graphs (but remain as a straight horizontal line).

cnlohr commented 8 years ago

That would be awesome. Any way you can "window" the groups, i.e. every one calculates for the next 32. If you get outlier syncs, throw them out? But yes! Keep going!

NeuralSpaz commented 8 years ago

you guys rock. Like your stuff. Though I would leave this here. No time to code it up atm but these analysis technique would be applicable if not just interesting reads. Might be even better if applied with some clustering and or Kalman filter to the estimated position/clock drift.

https://www.cs.umd.edu/class/spring2010/cmsc818g/slides/2010-03-25-TimeBasedLocation.pdf http://kilyos.ee.bilkent.edu.tr/~gezici/papers/2013_TCOM.pdf

ddrown commented 8 years ago

Ok, here's another set of graphs: https://dan.drown.org/clocks/index2.html

I used the time and frequency data from the first set, which is using .241 as the phase and frequency reference. I applied those corrections to each module's local clock and calculated the offsets of the other transmitters.

An interesting pattern shows up in this data: .241 is around 38 microseconds higher (2 times higher) than the other modules. I believe this is due to tx and rx delays.

The local timestamps on each module are relative to: .241 = 0 .179 = .241 + 25ns + txdelay + rxdelay .147 = .241 + 25ns + txdelay + rxdelay .213 = .241 + 26.925ns + txdelay + rxdelay .169 = .241 + 35.355ns + txdelay + rxdelay

The rebroadcast timestamps (these graphs) can be calculated as: tx_timeref + rf_delay + txdelay + rxdelay - rx_timeref

So, for the .179 transmitter this looks like: .179->.241 = (.241 + 25ns + txdelay + rxdelay) + 25ns + txdelay + rxdelay - 0 .179->.169 = (.241 + 25ns + txdelay + rxdelay) + 25ns + txdelay + rxdelay - (.241 + 35.355ns + txdelay + rxdelay) .179->.147 = (.241 + 25ns + txdelay + rxdelay) + 35.355ns + txdelay + rxdelay - (.241 + 25ns + txdelay + rxdelay) .179->.213 = (.241 + 25ns + txdelay + rxdelay) + 37.165ns + txdelay + rxdelay - (.241 + 26.925ns + txdelay + rxdelay)

This leads to .179->.241 having 2 * (txdelay + rxdelay) while the other paths cancel out one set of txdelay+rxdelay (on average as txdelay + rxdelay isn't a static number).

So I believe txdelay + rxdelay ~= 38 microseconds

cnlohr commented 8 years ago

I can believe that's about the right number. My fear is that tx can't be trusted AT ALL. It sounds like you've confirmed those fears.

cnlohr commented 8 years ago

I am bookmarking it and will read it more tomorrow.

cnlohr / esp8266rawpackets

Dealing with Clock Wander #2