cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.88k stars 3.77k forks source link

{GCBytes,Intent}Age calculations are broken #3234

Closed changrui82 closed 8 years ago

changrui82 commented 8 years ago

In MergeMVCCStats function, we have:

    diffSeconds := nowNanos/1E9 - rs.LastUpdateNanos/1E9
    ms.LastUpdateNanos = nowNanos
    ms.IntentAge += rs.IntentCount * diffSeconds
    ms.GCBytesAge += engine.MVCCComputeGCBytesAge(rs.KeyBytes+rs.ValBytes-rs.LiveBytes, diffSeconds)
    rs.MVCCStats.Add(ms)

And in rs.MVCCStats.Add(ms) we have:

...
    if oms.LastUpdateNanos > ms.LastUpdateNanos {
        ms.LastUpdateNanos = oms.LastUpdateNanos
    }

So, when update ms.LastUpdateNanos, we check whether oms.LastUpdateNanos greater than ms.LastUpdateNanos. But when calculate diffSeconds, we don't make any check. (rs.LastUpdateNanos and ms.LastUpdateNanos in rs.MVCCStats.Add(ms) is the same variable) As a result, I often see minus diffSeconds in my test, then ms.IntentAge will be minus(can be smaller than -10000 in busy range, because rs.IntentCount * diffSeconds will amplify minus diffSeconds again and again).

I'm not sure whether it is a problem, because after a long time, it can turn to positive number based on elapsedSeconds. But if I want to make intent GC start in a short time rather than very long time(24 hours currently), it will be a obstacle.

tbg commented 8 years ago

You're right, this isn't good. I think the nowNanos used ultimately comes from a BatchRequest.Header.Timestamp, so there's no guarantee that it will be increasing. You'd have to be non-increasing on the order of 1s, but with transaction timestamps around that's likely to happen, at least in busy tests (or real-world applications). I think the most straightforward solution is to use diffSeconds only if it turns out positive.

tbg commented 8 years ago

just pinging this since it's still happening:

  "gcbytesage": {
    "value": -1590486839237365
  },
petermattis commented 8 years ago

A little spelunking today: the most straightforward way we end up with a negative GCBytesAge value is for the BatchRequest.Header.Timestamp to be 0 which seems to happen for every batch containing a LeaderLease request. There may be more subtle problems with the {GCBytes,Intent}Age calculations involving non-monotonically increasing timestamps. I'm not at all confident that the way these values is updated is correct.

petermattis commented 8 years ago

@tschottdorf PTAL and confirm that this adjustment to the GCBytesAge calculation seems copacetic.

I spent a little time understanding the GCBytesAge computation (same idea applies to IntentAge). Here is my understanding:

GCBytesAge is computed as the delta in seconds from the current time to the time when the gc-able bytes were written (gc-able data is data that is not part of the most recent version). The naive computation would track the bytes written at each time and then compute the sum of a series of time deltas multiplied by the number of bytes written at that time. The naive computation is impractical as it would require storing an entry every time data was written at a particular timestamp. Instead, we keep track of a cumulative GCBytesAge value and the time at which it was computed:

type MVCCStats struct {
  GCBytes    int64
  GCBytesAge int64
  LastUpdate int64
}

The value of GCBytesAge at now is computed as GCBytesAge + GCBytes * (now - LastUpdate). When gc-able data is created (usually by the creation of a new version), the value of GCBytes is incremented, GCBytesAge is updated using the previous value of GCBytes and LastUpdate is set to the current time.

One wrinkle to throw into the mix: data can be written out of order because data that is written within a transaction is assigned the transaction's timestamp. Consider two transactions, A and B, which both write a key to a range, each creating a single byte of gc-able data. There are 2 scenarios to consider:

Parting thought: This GCBytesAge calculation is subtle and I'm not convinced it is a flexible approach. {GCBytes,Intent}Age are computed solely to drive the gcQueue and delete old versions and resolve old intents. But the computations are currently targeted at the existing GC heuristics (a TTL value). What if we wanted a GC heuristic which was to keep the most recent 10 versions for a key, no matter how old they are? Or the most recent N versions not exceeding some data size? I don't think we need to design for these heuristics right now, but want to point out how GCBytesAge is (over) optimized for our current heuristic.

tbg commented 8 years ago

I agree with your analysis.

tbg commented 8 years ago

working on this today, and updating the comments wildly while I'm there.

tbg commented 8 years ago

I'm still chowing on this. Hopefully when I'm done we'll have code that's more transparent, more maintainable, less scattered and better tested than what currently exists. This stuff is tricky (and has been used incorrectly in some places).

tbg commented 8 years ago

ping #3524 (since it's related)