Open rob05c opened 6 years ago
I'm assigning myself, but I likely won't have time soon, if anyone else wants to fix this, feel free to take it from me.
@rob05c - can you put a severity label on this?
Update
We have a PR to fix #1738 - PR #3104. But that PR does not fix this bug. The two bugs are:
monitoring.json
data is not snapshotted.monitoring.json
and CRConfig
match.This case is for number 2.
How to fix
The code needs changed to request both the CRConfig and monitoring.json immediately one after the other. But that doesn't completely fix it, because there's still a small race, if someone were to Snapshot in-between the Monitor's requests.
So, the requests actually need to happen in a loop, where each are requested, their timestamps are compared, and if one is older, it's fetched again, in a loop, until both timestamps match.
The pseudocode looks something like:
crConfig = GetCRConfig()
monitoring = GetMonitoring()
for {
if crConfig.SnapshotTime < monitoring.SnapshotTime {
crConfig = GetCRConfig()
continue
}
if monitoring.SnapshotTime < crConfig.SnapshotTime {
monitoring = GetMonitoring()
continue
}
break
}
@rob05c - can this be closed?
No, this isn't fixed. It's still a dangerous race condition in the Monitor.
The related issue https://github.com/apache/trafficcontrol/issues/1738 was fixed. That synchronizes the snapshots themselves in Traffic Ops.
But, even though the Snapshots are created in Traffic Ops at the same time, the Monitor has 2 separate polling threads, which aren't synchronized.
Therefore, for example, if a Profile is renamed, if the Monitor might poll the CRConfig, which has the new name, but it might be several seconds before the Monitor polls the Monitoring.json, which has the polling locations. During that time, the Monitor will be unable to find the polling location URL for the cache with the renamed Profile, and will be unable to poll that cache.
The Monitor's internal polls for those 2 objects need to happen at the same time, to fix this race condition.
The Monitor has two separate polls, to get new
CRConfig
, and to get newmonitoring.json
.Even if you make your changes and snapshot immediately, the Monitor might poll the CRConfig, get those updates, and it might be 5 or 30 seconds later before it polls
monitoring.json
and gets the updates from there.For example, if those changes include renaming a Profile, and the
monitoring.json
is polled first, the Monitor will be missing the old profile and be unable to poll caches with that profile, until it polls the CRConfig again and gets the change.It’s a bug, the Monitor needs to not apply changes from one until it gets the other. Whoever fixes this also needs to determine which should be first (until #1738 is fixed, which will make them the same [I'd suggest fixing #1738 first, but it's likely considerably more work than this]). I think the CRConfig should be first, but I may be wrong, and in fact, there may be no right answer, #1738 may have to be fixed first.
This is related to but distinct from #1738 . Symptoms will often be the result of some combination of both bugs.