Derecho-Project / derecho

The main code repository for the Derecho project.
BSD 3-Clause "New" or "Revised" License
187 stars 47 forks source link

Major bug in allocating memory for SST counters #148

Closed sagarjha closed 5 years ago

sagarjha commented 5 years ago

Lorenzo found that the sequence of counter values observed at a receiver node was not monotonic. This happened very rarely. One such non-monotonic sub-sequence, he found is, 19967, 20223 and 19968. The binary representation of these numbers are:

19967 (0100 1101 1111 1111) 20223 (0100 1110 1111 1111) 19968 (0100 1110 0000 0000)

20223 is a combination of the first 8 bits of 19968 and the last 8 bits of 19967. This either implies that the cache line read/write is not atomic (impossible) or the counter crosses the cache-line boundary.

We realized that the row space allocation for the SST had incorrect logic for padding the SST fields (in order to always have a field start at a boundary of 8 bytes). As a result, an SST field of 8 bytes could be split across two cache lines. This bug existed in the Derecho code for 3+ years, right from the time we redesigned the SST to allow for runtime fixed length vectors.

The fix is to fix the bug in the padding logic. Surprisingly, Derecho is very robust against such inconsistencies because the bits are probably written left to right, in which case, such an inconsistency can only increase the value of the observed counter, not decrease it. And since most of Derecho's algorithms involve taking minimum of a column, the probability that all the values of the column will artificially inflate to inflate the minimum is almost 0.

sagarjha commented 5 years ago

Fixed in ec4b9e0.