Rethinking the bounds calculations

vmurali commented 3 weeks ago

There's a different algorithm to calculate the exponent and B(ase), T(op) given a base and length that has the following properties:

Maintains a distinct representable but non dereferencable region of size $2^e$, where $e$ is the (decoded) exponent (same as current algorithm)
Simpler hardware circuit
Tighter bounds for any base, length compared to current algorithm
Size of bound regions double all the way till $2^{32}$ (not true in the current algorithm: there's a jump from $2^{14}$ to $2^{32}$)
Unique canonical representation of any cap(not true in the current algorithm: for instance, B = 1, T = 2, e = 5 is the same as B = 2, T = 4, e = 4)

Here's the algorithm. Given a base $b$ and length $l$, calculate the following: $S$ = mantissa size ($S = 9$ for CHERIoT; $S \ge 1$ should hold for the proof below) $F$ = full size ($F = 32$ for CHERIoT; $F \ge S$ should hold for the proof below) $e = \left\lceil lg(\left\lfloor \frac{l}{2^S} \right\rfloor) \right\rceil$ (This can be implemented similar to how it is done in sail specs using countLeadingZeros and subtracting from $F-S$ and a correction subtracting 1 if countOnes is 1) $d = \left\lfloor \frac{l}{2^e} \right\rfloor$ (This is just truncToMSB_variable where $e$ LSB bits are removed) $i = \left\lceil \frac{l\mod{2^e} + b\mod{2^e}}{2^e} \right\rceil$ (Uses an adder and a couple ofbitSelect's ($\frac{x \mod{2^{e+1}}}{2^e}$ is just the 2 MSB bits of $x[e+1:0]$))

The quantity $(d + i) \cdot 2^e$ is such that $(b - (b \mod{2^e})+ ((d + i) \cdot 2^e) \ge b + l)$, that is $e$ and $(d+i)$ gives (decoded) exponent and the required "difference between T and B". (We are still not done because we need the mantissa a.k.a difference to be $\lt 2^S$; but let's come back to this mantissa problem in a bit.)

Lemma 1: $(b - (b\mod{2^e}) + ((d + i) \cdot 2^e)) \ge (b + l)$ Proof:

Lemma is true if $(d+i)\cdot 2^{e} \ge l + (b\mod{2^e})$.
Substituting the equation for $d$, we get $l = d\cdot 2^{e} + (l\mod{2^e})$.
Now, lemma is true if $i \cdot 2^{e} \ge (l \mod{2^e}) + (b\mod{2^e})$.
Substituting the equation for $i$, we need to prove $\left\lceil \frac{(l\mod{2^e}) + (b\mod{2^e})}{2^e} \right\rceil \cdot 2^{e} \ge (l\mod{2^e}) + (b\mod{2^e})$.
But we know that $\left\lceil \frac{x}{y} \right\rceil \cdot y \ge x$ (Proof hint: let $x = p\cdot y - q$). $\square$

Now let's get the bounds for $d$ and $i$.

Lemma 2: if $e \ge 1$, $d \ge 2^{S-1}$. (This means the MSB of the difference between T and B is already guaranteed to be 1 when $e \ge 1$, so no need to do any more shifts to the left.) Proof:

Let $l = x \cdot 2^S + y$ where $0 \le y \lt 2^S, 0 \le x \le 2^{F-S}$.
Substituting for $e$, we have $e = \left\lceil lg(x) \right\rceil$. If $x \le 1$ then $e = 0$; so $x \ge 2$.
From $\left\lceil lg \right\rceil$ aka $lgCeil$, we have $2^e \ge x \gt 2^{e-1}$.
Let $x = 2^{e-1} + k, 1 \le k \le 2^{e-1}$. Substituting for $l, x$ in equation for $d$ we have:
$d = \left\lfloor \frac{(2^{e-1} + k) \cdot 2^S + y}{2^e} \right\rfloor$ $= 2^{S-1} + \left\lfloor \frac{k \cdot 2^S + y}{2^e} \right\rfloor$ $\ge 2^{S-1}$. $\square$

Lemma 3: $d \lt 2^{S+1}$ (This bounds the number of shifts we need to do to the right; we also need to know the bound for $i$ for the maximum number of shifts for $d + i$.) Proof:

Let $l = x \cdot 2^S + y$ where $0 \le y \lt 2^S, 0 \le x \le 2^{F-S}$.
If $x = 0$ then $e = 0$, so $d = \left\lfloor l \right\rfloor = \left\lfloor y \right\rfloor \lt 2^S$ $\square$
If $x = 1$ then $e = 0$, so $d = \left\lfloor l \right\rfloor = \left\lfloor 2^S + y \right\rfloor \lt 2^S + 2^S = 2^{S+1}$ $\square$
If $x \ge 2$ then from $\left\lceil lg \right\rceil$ aka $lgCeil$, we have $2^e \ge x \gt 2^{e-1}$
Thus, $\frac{1}{2^e} \le \frac{1}{x} \rightarrow \frac{l}{2^e} \le \frac{l}{x} = 2^S + \frac{y}{x}$
Thus, $d = \left\lfloor \frac{l}{2^e} \right\rfloor \le 2^S + \left\lfloor \frac{y}{x} \right\rfloor \lt 2^S + 2^S = 2^{S+1}$ $\square$

Lemma 4: $i \le 2$ Proof:

If $e = 0$, $i = 0$ $\square$
If $e = 1$, $i = \left\lceil \frac{(l\mod 2) + (b\mod 2)}{2} \right\rceil \le \left\lceil \frac{2}{2} \right\rceil = 1$ $\square$
If $e \ge 2$, $i = \left\lceil \frac{(l\mod{2^e}) + (b\mod{2^e})}{2^e} \right\rceil \le \left\lceil \frac{2^e-1 + 2^e-1}{2^e} \right\rceil = \left\lceil \frac{2^{e+1}-2}{2^e} \right\rceil = 2$ $\square$

Lemma 5: $(d + i) \le 2^{S+1} + 1$. (So we need to shift $(d + i)$ right by at most 3 to keep it $\le 2^9 - 1$; you need 1 shift to bring from 11 bits to 10 bits, and you need two shifts to bring from 10 bits to 9 bits (for example, when $(d+i) = 2^{10}-1$)) Proof: From Lemmas 3 and 4 $\square$

Lemma 6: $x \cdot 2^y \le (\left\lfloor \frac{x}{2} \right\rfloor + x \mod{2}) \cdot 2^{y+1}$. This gives us a way to right shift $(d+i)$ to keep it $\le 2^S - 1$) Proof:

Let $x = 2p + q, 0 \le q \le 1$.
Substituting, we need to prove $(2p + q) \cdot 2^y \le (p + q) \cdot 2^{y+1}$.
That is to prove $(p \cdot 2^{y+1} + q \cdot 2^y) \le p \cdot 2^{y+1} + q \cdot 2^{y + 1}$. $\square$

Let the final value (after right shifts) of mantissa be $m$ and final value of exponent be $E$. $m$ represents T - B. B can be calculated using $\left\lfloor \frac{b}{2^{E}} \right\rfloor$, which is essentially truncToMSB_variable. T can be calculated by adding $m$ to B. (I am ignoring the distinction between MSB bits and the "mid" bits here. $\left\lfloor \frac{b}{2^E}\right\rfloor$ is split into MSB and "mid" bits exactly as in the current algorithm and $c_b$, $c_t$ are calculated in the same manner as the current algorithm.)

Since $m =$T - B is guaranteed to have MSB set to 1 (when $E \ge 1$), we don't have to represent T with 9 bits, and instead we can use 8 bits to represent T (This part is Jon Woodruff's optimization suggestion; we need an adder to calculate the MSB of T. Alternatively, we can just store $m$, i.e. the difference, instead of T, which will avoid the addition.)

Note that for all values of $E \lt (F-S)$ any expanded cap will always have a canonical representation as a compressed cap since the MSB of $m = 1$. So, any 64-bit value loaded during a memcpy can be expanded as a cap and stored in the registers canonically if $E \lt (F-S)$. For $E \ge (F-S)$ however, the left shifting will push the B and T or $m$ out of 32-bit range, and so we need additional 9 bits to hold the shifted values. In fact, the additional 9 bits will hold exactly the B and T or $m$. So, if we store 41 bits for base and 41 for length (or top) in the registers, recompressing them will give back the original compressed encoding always, even if the original value was not a cap.

vmurali commented 3 weeks ago

Note that the encoding in Appendix D of the spec actually does not work. In particular, the requirement for the representable region having $2^e$ bytes outside of the dereferencable region essentially requires the same logic that increments T twice in the current algorithm. However, if one is okay with having just 1 byte in the representable region outside of the dererferencable region, then one can just get the bounds for the requested length $+ 1$. That might be acceptable too. The above algorithm has to then be tweaked accordingly.

jonwoodruff commented 3 weeks ago

@vmurali Looks thorough! I haven't digested the whole thing, but would like to ask about one detail which might help elucidate the rest.

For E ≥ 23 however, the left shifting will push the B and T or m out of 32-bit range, and so we need additional 9 bits to hold the shifted values.

Is there value in allowing Es that push B and T beyond the 32-bit range? We have typically capped valid Es at the value that puts the implied msb of l at the 33rd bit, that is, the entire address space (0x00000000-0xFFFFFFFF).

vmurali commented 3 weeks ago

Yes, I mean, all E's are now valid, including 24 to 31. It makes it seamless to use memcpy using capability-load and capability-store while storing only expanded caps in the registers (where top and base are both 41 bits)

jonwoodruff commented 3 weeks ago

Is there a problem with interpreting everything above 23 as 23 when encoding/decoding while storing the original E? I'm not sure what all the tradeoffs are, but there should be fewer muxes into the upper bits of expanded T & B, if that's worth anything.

jonwoodruff commented 3 weeks ago

See issue https://github.com/CHERIoT-Platform/cheriot-sail/issues/45 Would be nice to understand how this proposal compares to that one, and what improvement it gives on top.

vmurali commented 3 weeks ago

Is there a problem with interpreting everything above 23 as 23 when encoding/decoding while storing the original E? I'm not sure what all the tradeoffs are, but there should be fewer muxes into the upper bits of expanded T & B, if that's worth anything.

You still need to store some version of the compressed caps' T and B if you want to use cap-load and cap-store for memcpy's (which is necessary because in a struct, you don't know which fields are caps). This encoding (storing 41 bits) eliminates the need for special logic to decide if E is $\le 23$ and treats everything uniformly. You can still use just 32-bit adders for dealing with actual caps.

vmurali commented 3 weeks ago

See issue https://github.com/CHERIoT-Platform/cheriot-sail/issues/45 Would be nice to understand how this proposal compares to that one, and what improvement it gives on top.

This incorporates the first suggestion you gave in #45 , namely reusing a bit from T - B to use for exponent. The rest of the algorithm is different from what happens in CHERIoT right now

jonwoodruff commented 3 weeks ago

I think I understand why you're storing the 41 bits. We have previously solved the same problem by changing the interpretation of E >= 23 to being the same as 23 to avoid needing to shift T & B higher than 33. This also works with memcpy; you preserve all the bits of T & B; you just leave them at the same place as you would if E==23.

vmurali commented 3 weeks ago

I think I understand why you're storing the 41 bits. We have previously changed the interpretation of E >= 23 to being the same as 23 to avoid needing to shift T & B higher than 33. This also works with memcpy; you preserve all the bits of T & B; you just leave them at the same place as you would if E==23.

Right. But you still need to store the T and B of the compressed meta data somewhere to restore that on a memcpy. Just storing 41 bits and not storing the compressed bounds will achieve the same result (with the same number of bits but without special case handling, thereby reducing the circuit complexity)

jonwoodruff commented 3 weeks ago

How does "storing 41 bits" for each bound use "the same number of bits" as storing 33 bits for each bound? This would be at the cost of "special case handling", as you say, but I don't know what the complexity of that is after boolean optimization. (The logic that feeds each layer of muxing in the shifter would be different, though I really know if it would be more complex. It would likely be simpler as you go up to higher bits.)

vmurali commented 3 weeks ago

"the same number of bits" as storing 33 bits for each bound?

If you just have 33 bits, when a cap-load instruction is used, you get 64 bits and a tag bit. Let's say the tag is not set (i.e. this is part of memcpy of a struct). Now, you need to preserve the 64 bits when writing it back. How will you do that unless you recover the T and B part of the metadata unless they are stored somewhere? One way is to store T and B in the 33-bit bounds, and then use whether E is $\ge 23$ to decide if the bounds area stores T and B or the expanded bounds. I guess that's what you are talking about. Instead, if you were storing the compressed cap as well as the expanded bounds, then we don't need extra storage, as we would be needing 41 bits anyway.

Anyway, this overall is a minor point. The algorithm (and the circuit decomplexity that comes with it) is more pertinent than whether a few bits are saved.

jonwoodruff commented 3 weeks ago

If I understood your proposal, you keep 41 bits per bound with ~9 bits potentially non-zero, and the rest zero. When you store it out, you select the ~9 bits back out of the 41-bit value to derive the original T & B for the 64-bit format.

I'm suggesting that if you decode the T and B into 33 bits instead, interpreting all E>=23 as 23, you will also preserve T and B in all cases and can select them out when re-deriving the original 64-bit format. For example, if E happens to be 30, you shift T and B up to the top of their 33-bit decompressed bounds, preserving E=30 in the register. When storing back out, you interpret E=30 as 23 so that you correctly select the upper bits to put in back in the T and B fields of the 64-bit format.

vmurali commented 3 weeks ago

Right, I think that works too.

vmurali commented 3 weeks ago

I implemented the algorithm in sail in #76 . I also changed the encoding to reflect base (B) and difference (M) rather than base (B) and top (T). M is 8 bits, while B is still 9 bits, and there's an extra ME field of 1 bit used either for exponent (when exponent is not zero) or the MSB of M.

As expected, all tests and examples in cheriot-rtos pass.

CHERIoT-Platform / cheriot-sail

Rethinking the bounds calculations #75