apache / trafficcontrol

Apache Traffic Control is an Open Source implementation of a Content Delivery Network
https://trafficcontrol.apache.org/
Apache License 2.0
1.09k stars 344 forks source link

TM integration tests random enough to fail sometimes #5975

Open ocket8888 opened 3 years ago

ocket8888 commented 3 years ago

I'm submitting a ...

Traffic Control components affected ...

Current behavior:

The Traffic Monitor integration tests have a check for its /api/bandwidth-kbps API endpoint. This endpoint returns the sum of the bandwidths of polled cache servers from the last time it polled. This endpoint is not mocked, its data comes from the testcaches/fakesrvr tool that populates the various testing mock ATS caches used by the tests. The bandwidth data is calculated by dividing the difference between the current and last-measured value of a field returned by an astats (or stats_over_http) request by the amount of time that passed between polls, multiplied by a constant of proportionality. Thus we have

image

where N is the number of servers polled, tn is the time elapsed between polls for cache n, xn is the current value of the astats/stats field, xn is the last-measured value of said field, and k is a proportionality constant.

The field in question that it's measuring is calculated from a "/proc/net/dev line" which essentially boils down to this: the field is initially zero, but every second that passes the fake server adds a random amount on a certain interval for each configured "remap". The random interval itself is defined by the number of "remaps". Specifically, the interval minimum is always hard-coded to 0, but the maximum for the ith "remap" ri is given by

image

where Nr is the total number of "remaps", giving the upper bound of the full addition per second to xn as:

image

The number of remaps used in the GHA is hard-coded to 2, so this can be simplified:

image

So basically, as t is in seconds, this adds a random value on [0,50) every second to the "outBytes" used to determine bandwidth. The polling interval for Traffic Monitor in these tests is 6 seconds, so we can reasonably approximate that xn=xn(t-6) and tn=6 for all n. So, the population average selections from the uniform distribution give us a normal distribution with a lower bound of 0, an expectation value of 24.5, and an upper bound of:

image

... since N is hard-coded to 2, and the proportionality constant k is the number of bytes in a kilobit, which is 125.

The test in kbps_test.go checks that the value received is between 5000 and 20000, this corresponds respectively to an emitted rate on the interval from 20 to 49 (the upper bound on that check exceeds the upper bound of possible values). Which puts this lower bound near the 25th percentile of a roughly normal distribution (source: https://www.wolframalpha.com/input/?i=normal+distribution+mean%3D24.5+standard+deviation%3D4.94) with a mean of 24.5, although I'm not good at finding standard deviations or whatever so that might not be exactly right. The point is, there is a non-statistically-insignificant probability that the test will just randomly fail.

Expected behavior:

Tests should not rely on checking for ranges for extremely random data. The test should figure out what it's testing (marshalling the data? Accurate reporting of known data?) and test that exactly to avoid random failures.

Minimal reproduction of the problem with instructions:

Try running the TM integration tests a bunch

ocket8888 commented 3 years ago

Actually, in order to fall as low as 5000 I think both caches need to report a rate under 20, so I think the real probability of a fake failure drops to between 5-2 and 4-2, or between about 5% and 6.25%.

mitchell852 commented 3 years ago

ok, that was a LOT of math. none of which i understood except for this part

between about 5% and 6.25%.

so basically 1 out of 20 times or so TM integration fails incorrectly...

ocket8888 commented 3 years ago

basically, yeah