apache / trafficcontrol

Apache Traffic Control is an Open Source implementation of a Content Delivery Network
https://trafficcontrol.apache.org/
Apache License 2.0
1.03k stars 339 forks source link

Traffic Monitor CRConfig and monitoring.json polls aren't synchronized #1739

Open rob05c opened 6 years ago

rob05c commented 6 years ago

The Monitor has two separate polls, to get new CRConfig, and to get new monitoring.json.

Even if you make your changes and snapshot immediately, the Monitor might poll the CRConfig, get those updates, and it might be 5 or 30 seconds later before it polls monitoring.json and gets the updates from there.

For example, if those changes include renaming a Profile, and the monitoring.json is polled first, the Monitor will be missing the old profile and be unable to poll caches with that profile, until it polls the CRConfig again and gets the change.

It’s a bug, the Monitor needs to not apply changes from one until it gets the other. Whoever fixes this also needs to determine which should be first (until #1738 is fixed, which will make them the same [I'd suggest fixing #1738 first, but it's likely considerably more work than this]). I think the CRConfig should be first, but I may be wrong, and in fact, there may be no right answer, #1738 may have to be fixed first.

This is related to but distinct from #1738 . Symptoms will often be the result of some combination of both bugs.

rob05c commented 6 years ago

I'm assigning myself, but I likely won't have time soon, if anyone else wants to fix this, feel free to take it from me.

mitchell852 commented 5 years ago

@rob05c - can you put a severity label on this?

rob05c commented 5 years ago

Update

We have a PR to fix #1738 - PR #3104. But that PR does not fix this bug. The two bugs are:

  1. The monitoring.json data is not snapshotted.
  2. The Traffic Monitor doesn't ensure the versions of monitoring.json and CRConfig match.

This case is for number 2.

How to fix

The code needs changed to request both the CRConfig and monitoring.json immediately one after the other. But that doesn't completely fix it, because there's still a small race, if someone were to Snapshot in-between the Monitor's requests.

So, the requests actually need to happen in a loop, where each are requested, their timestamps are compared, and if one is older, it's fetched again, in a loop, until both timestamps match.

The pseudocode looks something like:

crConfig  = GetCRConfig()
monitoring = GetMonitoring()
for {
  if crConfig.SnapshotTime < monitoring.SnapshotTime {
    crConfig = GetCRConfig()
    continue
  }
  if monitoring.SnapshotTime < crConfig.SnapshotTime {
    monitoring = GetMonitoring()
    continue
  }
  break
}
mitchell852 commented 4 years ago

@rob05c - can this be closed?

rob05c commented 4 years ago

No, this isn't fixed. It's still a dangerous race condition in the Monitor.

The related issue https://github.com/apache/trafficcontrol/issues/1738 was fixed. That synchronizes the snapshots themselves in Traffic Ops.

But, even though the Snapshots are created in Traffic Ops at the same time, the Monitor has 2 separate polling threads, which aren't synchronized.

Therefore, for example, if a Profile is renamed, if the Monitor might poll the CRConfig, which has the new name, but it might be several seconds before the Monitor polls the Monitoring.json, which has the polling locations. During that time, the Monitor will be unable to find the polling location URL for the cache with the renamed Profile, and will be unable to poll that cache.

The Monitor's internal polls for those 2 objects need to happen at the same time, to fix this race condition.