Closed changrui82 closed 8 years ago
You're right, this isn't good. I think the nowNanos
used ultimately comes from a BatchRequest.Header.Timestamp
, so there's no guarantee that it will be increasing. You'd have to be non-increasing on the order of 1s
, but with transaction timestamps around that's likely to happen, at least in busy tests (or real-world applications). I think the most straightforward solution is to use diffSeconds
only if it turns out positive.
just pinging this since it's still happening:
"gcbytesage": {
"value": -1590486839237365
},
A little spelunking today: the most straightforward way we end up with a negative GCBytesAge
value is for the BatchRequest.Header.Timestamp
to be 0 which seems to happen for every batch containing a LeaderLease
request. There may be more subtle problems with the {GCBytes,Intent}Age
calculations involving non-monotonically increasing timestamps. I'm not at all confident that the way these values is updated is correct.
@tschottdorf PTAL and confirm that this adjustment to the GCBytesAge
calculation seems copacetic.
I spent a little time understanding the GCBytesAge
computation (same idea applies to IntentAge
). Here is my understanding:
GCBytesAge
is computed as the delta in seconds from the current time to the time when the gc-able bytes were written (gc-able data is data that is not part of the most recent version). The naive computation would track the bytes written at each time and then compute the sum of a series of time deltas multiplied by the number of bytes written at that time. The naive computation is impractical as it would require storing an entry every time data was written at a particular timestamp. Instead, we keep track of a cumulative GCBytesAge
value and the time at which it was computed:
type MVCCStats struct {
GCBytes int64
GCBytesAge int64
LastUpdate int64
}
The value of GCBytesAge
at now
is computed as GCBytesAge + GCBytes * (now - LastUpdate)
. When gc-able data is created (usually by the creation of a new version), the value of GCBytes
is incremented, GCBytesAge
is updated using the previous value of GCBytes
and LastUpdate
is set to the current time.
One wrinkle to throw into the mix: data can be written out of order because data that is written within a transaction is assigned the transaction's timestamp. Consider two transactions, A and B, which both write a key to a range, each creating a single byte of gc-able data. There are 2 scenarios to consider:
TxnTime(A) < TxnTime(B)
and WriteTime(A) < WriteTime(B)
: This is the natural ordering and the only one the code currently handles. Let's assume TxnTime(A)
is 1 and TxnTime(B)
is 2.
Txn-A
creates its gc-able byte, GCBytes
is set to 1, GCBytesAge
is set to 0 and LastUpdate
is set to 1. Txn-B
creates its gc-able byte, GCBytes
increases to 2, GCBytesAge
is set to 1 (the previously GCBytes
value times the time delta) and LastUpdate
is set to 2.TxnTime(B) < TxnTime(A)
and WriteTime(A) < WriteTime(B)
: The write time ordering differs from the transaction time ordering. Let's assume TxnTime(B)
is 1 and TxnTime(A)
is 2.
Txn-A
creates its gc-able byte, GCBytes
is set to 1, GCBytesAge
is set to 0 and LastUpdate
is set to 2.Txn-B
creates its gc-able byte, GCBytes
is set to 2 as in the first scenario. But the computation of GCBytesAge
is problematic as the byte is being written in the past (TxnTime(B) < LastUpdate
). When this occurs, LastUpdate
is left alone (it is defined as the high water mark of times that have been seen) and GCBytesAge
is incremented by BytesWritten * (LastUpdate - TxnTime(B))
. Parting thought: This GCBytesAge
calculation is subtle and I'm not convinced it is a flexible approach. {GCBytes,Intent}Age
are computed solely to drive the gcQueue
and delete old versions and resolve old intents. But the computations are currently targeted at the existing GC heuristics (a TTL value). What if we wanted a GC heuristic which was to keep the most recent 10 versions for a key, no matter how old they are? Or the most recent N versions not exceeding some data size? I don't think we need to design for these heuristics right now, but want to point out how GCBytesAge
is (over) optimized for our current heuristic.
I agree with your analysis.
working on this today, and updating the comments wildly while I'm there.
I'm still chowing on this. Hopefully when I'm done we'll have code that's more transparent, more maintainable, less scattered and better tested than what currently exists. This stuff is tricky (and has been used incorrectly in some places).
ping #3524 (since it's related)
In
MergeMVCCStats
function, we have:And in
rs.MVCCStats.Add(ms)
we have:So, when update
ms.LastUpdateNanos
, we check whetheroms.LastUpdateNanos
greater thanms.LastUpdateNanos
. But when calculatediffSeconds
, we don't make any check. (rs.LastUpdateNanos
andms.LastUpdateNanos
inrs.MVCCStats.Add(ms)
is the same variable) As a result, I often see minusdiffSeconds
in my test, then ms.IntentAge will be minus(can be smaller than -10000 in busy range, becausers.IntentCount * diffSeconds
will amplify minus diffSeconds again and again).I'm not sure whether it is a problem, because after a long time, it can turn to positive number based on
elapsedSeconds
. But if I want to make intent GC start in a short time rather than very long time(24 hours currently), it will be a obstacle.