StartSlice of some flow events is wrong because of floating point resolution in JavaScript

catapult-bot commented 9 years ago

Issue by yuhao Wednesday Jun 24, 2015 at 00:44 GMT Originally opened as https://github.com/google/trace-viewer/issues/1046

Trace: https://github.com/natduca/measurmt-traces/blob/master/mobile/googlemaps_ipc_newbinding.html

The Scheduler::ScheduleBeginImplFrameDeadline Slice that starts at 4,315.215 ms should have an outgoing PostTask flow event. However, the start slice of that flow event is now the toplevel MessageLoop::RunTask.

The reason is that the end timestamp of Scheduler::BeginFrame (which is the subslice of toplevel MessageLoop::RunTask) was miscalculated. The start timestamp was 326461495.395, and duration is 0.122, the end timestamp should be 326461495.517; instead it was mis-calculated as 326461495.51699996, which made BeginFrame end before the flow event start timestamp.

@natduca @dj2

catapult-bot commented 9 years ago

Comment by randomascii Wednesday Jun 24, 2015 at 00:54 GMT

It's not "mis-calculated as 326461495.51699996". Instead, that is the closest double to the requested value. Doubles are accurate to about one part in 10^15, and cannot exactly display most fractions. Only multiples of powers of two (1/4, 1/64th, etc.) can be exactly stored. 0.1 cannot be, for instance.

Ideally the same calculation should be used to calculate the point in both contexts (end timestamp and flow-event). If that is not practical then a small epsilon is needed - but use with caution.

Or, convert everything to integers (multiply by an appropriate number) and then do the math in doubles or int64.

The precision will be much better if the numbers are smaller. Why are the numbers in the 326 million range?

catapult-bot commented 9 years ago

Comment by yuhao Wednesday Jun 24, 2015 at 00:55 GMT

Indeed I could temporarily get around that by calculating the end timestamp as (this.start * 1000 + this.duration * 1000) / 1000.

catapult-bot commented 9 years ago

Comment by yuhao Wednesday Jun 24, 2015 at 00:56 GMT

The flow-event timestamp wasn't calculated, but provided in the trace; however, the end timestamp of a slice was calculated by doing start + duration.

catapult-bot commented 9 years ago

Comment by yuhao Wednesday Jun 24, 2015 at 00:57 GMT

Those numbers are in the range of 326 million as a direct result of Chrome's timestamp system, I think.

catapult-bot commented 9 years ago

Comment by randomascii Wednesday Jun 24, 2015 at 01:12 GMT

Well, as long as we use double we still have lots of precision. 326 million uses up 29 bits, which still leaves 23 bits for the fractional part - which is lots.

A small epsilon may be necessary. If the numbers are printed to 0.001 precision then an epsilon of 0.001 may be needed. But beware, that way leads to dragons.

catapult-bot commented 9 years ago

Comment by yuhao Friday Jun 26, 2015 at 03:39 GMT

Another similar issue is that when we call endSlice, we calculate the duration of a slice by doing slice.duration = ts - slice.start, which makes the duration inexact. Later on when we call the getter to get the end timestamp, it becomes inexact also.

I was wondering the reason we always divide the original timestamp by 1000 before storing it. It's obviously a unit conversion. But are we worried about overflowing?

catapult-bot commented 9 years ago

Comment by randomascii Friday Jun 26, 2015 at 18:34 GMT

I double checked my math and here is what is actually happening:

326461495.395 cannot be represented as a double. The closest double is approximately 326461495.39499998 - slightly lower.
326461495.517 cannot be represented as a double. The closest double is approximately 326461495.51700002 - slightly higher.

.122 also can't be represented exactly but its error is shifted much further down and we can ignore it.

When .122 is added to 326461495.395 the rounding works out that such that that the result is slightly less than the desired answer instead of slightly greater. The two doubles are adjacent -- off by one unit in the last place.

The best fix is to acknowledge that all of the numbers are supposed to have micro-second (???) precision, and round all of them to that precision. The numbers coming from the .json are already rounded to that precision. All that is needed is to round the results of additions (time + interval) and subtractions (time - time) to that precision. The answer will not be exactly the desired time, but it will be the closest double to that time, so it will be consistent, which is all that we desire.

Using an epsilon in the comparisons (a single ULP will do in this case) is also possible, but we may encounter a case where a series of operations leads to slightly greater error, so this answer feels less sustainable than rounding to exactly the desired precision after time calculations.

I feel duty bound to link to my favorite (okay, I wrote it) article on comparing floats: https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/

This is my quick-and-dirty program for checking what is going on. I paused in the debugger to examine the doubles and their underlying integer representations:

union double_t { double d; unsigned long long i; };

int _tmain(int argc, _TCHAR* argv[]) { double_t x; x.d = 326461495.395; double_t y; y.d = 326461495.517; double_t z; z.d = x.d + .122; double_t d; d.d = .122; return 0; }

catapult-bot commented 9 years ago

Comment by natduca Tuesday Jun 30, 2015 at 17:30 GMT

I think our direction here is to come up with a new flow event api that is based on explicit id binding, instead of the timestamp binding. @vmpstr @yuhao is that correct? If so, what should we do with this bug?

catapult-bot commented 9 years ago

Comment by yuhao Tuesday Jun 30, 2015 at 18:38 GMT

Yes but that would not solve the issue that the end timestamp of some slices are not calculated precisely because as of how we calculate it as I mentioned two posts earlier.

Is there a fundamental reason we convert the timestamp to microseconds first before we do any timing calculation?

But from the perspective of correctly binding flow events, this doesn't really matter to much. I just worry that this might lead to issues in the future.

catapult-bot commented 9 years ago

Comment by natduca Wednesday Jul 01, 2015 at 01:26 GMT

First, I'm confused. We don't convert to microseconds. The model is in milliseconds.

We get timestamps in as micros, and the first thing we do is convert to milliseconds. Then, we do our grouping logic. We shift to zero at the end of import.

I have proposed that we change this computation around in #856. With that change done [which is real work] we would go from micros to zero-based ms immediately, and then do grouping logic in milliseconds. I believe this has better precision properties, no?

catapult-bot commented 9 years ago

Comment by randomascii Wednesday Jul 01, 2015 at 18:22 GMT

Converting to zero based (subtracting off the large offset, 326461495 in the examples above) does give more precision for subsequent calculations. However this will not resolve the issue above. The problem is that x and y round in different directions in the sample code above. That rounding happens when converting from text to double and is unavoidable. Since the subtracting of 326461495 happens after the rounding it is too late.

In fact, one impact of subtracting 326461495 is it makes the rounding error more prominent. It is initially about one part in 10^16. After subtracting to make it zero-based the error is more like one part in 10^7! The absolute error is unchanged at ~3.8e-8. I've verified all this by modifying the test program I posted above.

The increased error ratio isn't necessarily a problem, but it does show that zero-based is not a panacea.

Rounding to the nearest millisecond or microsecond or nanosecond or something after each addition or subtraction should work. That should probably be done after changing to zero-based to avoid propagating the one part in 10^7 error.

Note that if we want to support nano-second resolution we need to make sure that our starting point is never too large. To have nano-second precision we need to sure that the times we store are never more than 2^23 seconds - 8 million seconds. This is true whether our units are seconds/ms/us/ns. Beyond 8 million seconds our precision will gradually degrade below ns level and avoiding rounding errors will get harder.

catapult-project / catapult

StartSlice of some flow events is wrong because of floating point resolution in JavaScript #1046