Open blckngm opened 2 years ago
@sopium we've done practically no optimization work, so I'm glad you're thinking about it.
Inline assembly is fine, but would need to be feature gated due to the higher MSRV (we can eventually consider bumping it, but our last bump was already painful)
For something like [T; N]
though, it would be good to ensure this doesn't monomorphize to N
distinct blobs of ASM.
It boils down to LLVM not able to optimize write_volatile
for u8 pointers. The bytes have to be written individually.
I think monomorphize for 16/32/64-byte arrays should be fine. These would also be commonly used key sizes in cryptography so the optimization can be justified.
For a u8 slice with unknown but largish size you can also use align_to_mut
:
pub fn zeroize_slice(x: &mut [u8]) {
unsafe {
// Use __m128 on x86/x86_64.
let (p, m, s) = x.align_to_mut::<u64>();
for b in p {
core::ptr::write_volatile(b, 0);
}
for m in m {
core::ptr::write_volatile(m, 0);
}
for b in s {
core::ptr::write_volatile(b, 0);
}
}
}
The use of write_volatile
was something of a hack to ensure that the writes would not be elided by the compiler.
But inline assembly should also ensure that as well, and permit a highly performant implementation.
The use of
write_volatile
was something of a hack to ensure that the writes would not be elided by the compiler.But inline assembly should also ensure that as well, and permit a highly performant implementation.
Yes, I understand this. If it's “normal” write in a loop LLVM would be able recognize it and generate efficient instructions or call memset
.
Yeah the typical way to implement a zeroization primitive efficiently is some sort of volatile memset
typically realized by using some kind of compiler fence, often implemented as invoking memset
in conjunction with an assembly-based dataflow optimization barrier.
While Rust has bindings to LLVM's volatile memset intrinsics, they are perma-unstable and no work has been done on a stable API AFAIK:
https://doc.rust-lang.org/std/intrinsics/fn.volatile_set_memory.html
But perhaps inline assembly could provide the next best thing, albeit on a target architecture-by-architecture basis.
We can not use write_volatile
on *mut __m128
which we got from [u8; 32]
, since the pointer could be insufficiently aligned. AFAIK we do not have unaligned version of the volatile write. As a hack we could split [u8; 32]
into aligned and unaligned parts. In theory, compiler should be able to remove the unaligned code parts if the array is sufficiently aligned.
We can not use
write_volatile
on*mut __m128
which we got from[u8; 32]
, since the pointer could be insufficiently aligned. AFAIK we do not have unaligned version of the volatile write. As a hack we could split[u8; 32]
into aligned and unaligned parts. In theory, compiler should be able to remove the unaligned code parts if the array is sufficiently aligned.
I was suggesting using [__m128; N]
from the start, and transmuting to &[u8; N * 16]
when necessary.
As for splitting to aligned and unaligned parts, that's exactly what align_to_mut
does.
Yeah the typical way to implement a zeroization primitive efficiently is some sort of volatile
memset
typically realized by using some kind of compiler fence, often implemented as invokingmemset
in conjunction with an assembly-based dataflow optimization barrier.
Oh this reminds me, now that we have stable inline assembly, this can be done too, and it's not architecture dependent:
pub fn zeroize(x: &mut [u8]) {
for b in x.iter_mut() {
*b = 0;
}
unsafe {
core::arch::asm!(
"/* {ptr} */",
ptr = in(reg) x.as_mut_ptr(),
options(nostack, readonly, preserves_flags),
);
}
}
pub fn f() {
let mut x = [0u8; 32];
zeroize(&mut x);
}
Playground: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=32fc07b8173bd2ce233eaa9bf0d8d81a
zeroize
in f
is optimized to two movaps
but is not optimized out.
@sopium I have a question about the use of readonly
in the assembly options. (This is mostly for my own understanding.)
I watched a talk by Chandler Carruth (major contributor to LLVM) a long time ago, where he described creating similar functions in order to defeat the optimizer for benchmarking purposes: https://www.youtube.com/watch?v=nXaxk27zwlk&t=2476s
static void escape(void * p) {
asm volatile("" : : "g"(p) : "memory")
}
I realize that this is a C++ talk and not rust, but they are both going through LLVM in the end so there's a lot of overlap.
In his talk he seems to say that it's important for the optimizer to believe that
(1) This asm has side-effects that the optimizer cannot know about. In C++ that's done with "volatile" asm, and in rust I think that's done by not marking your asm pure
(2) This asm potentially clobbers all memory. So, any reads from memory after this point have to actually go back to memory and any values cached in registers are dirty.
In your block you use the readonly
flag and I'm trying to understand what this does and why this is good.
So, the optimizer has to believe that
So 3 is good because it will help the efficiency of code using this block.
But it's hard for me to understand, if we're promising the optimizer that flags are not changed, registers are not changed (only the in(reg) for ptr
is read from), and memory is not changed, what is the side-effect that the compiler still has to be worried about that prevents it from dropping this block? Or the model is just that, it still has to assume that there might be something beyond this that it doesn't understand?
@cbeck88
AFAIK readonly
applies only to memory of the current process, the asm block from compiler's PoV may in theory perform a syscall and it will be well in bounds of the asm!
options contract.
I see, thank you
It looks to me that one barrier to actually creating a patch that does this is the need for some form of the specialization language feature.
We currently have these impl's:
/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].
impl<Z, const N: usize> Zeroize for [Z; N]
where
Z: Zeroize,
{
fn zeroize(&mut self) {
self.iter_mut().zeroize();
}
}
/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].
impl<Z, const N: usize> Zeroize for [Z; N]
where
Z: Zeroize,
{
fn zeroize(&mut self) {
self.iter_mut().zeroize();
}
}
If we wanted to also add impls for the special case that Z = u8
, then these will be overlapping impls, and we obviously cannot remove the existing impls.
I had this same problem in a hashing library: https://github.com/mobilecoinfoundation/mobilecoin/blob/b07c17fd2e1946c15b3a38c30b0fb1dc4ab516e3/crypto/digestible/src/lib.rs#L218
What I decided to do there was, simply not support my trait on the primitive type u8
, so that I could have my nice impl for [T]
and a nice impl for [u8]
. Since it's extremely rare that someone actually needs to put a single u8
in a struct, I never actually had a problem, and I was able to do this kind of optimization without waiting for specialization to land.
In your case you already shipped a stable API which supports u8
so you probably don't want to do this approach.
u8
is pretty commonly used as a carrier for masked boolean values when writing constant-time code
Good point
So do you see a path forward for this? Or just wait for specialization?
It seems to me we could make an optimized "zeroize_bytes" function based on the approach here, and people could either implement Zeroize
manually and call out to it, or we could make like a #[zeroize(with = ...)]
attribute for the zeroize-derive
proc macro? Similar to what they do in serde? Is that something you would support?
A free function works, although is somewhat inelegant.
It might be possible to optimize the impl on [T]
generically, packing as many copies of <T as Default>::default()
as can fit into a quadword, then writing quadwords with movups
until the remaining space in the slice is less-than-quadword sized, then falling back to writing them one-at-a-time in order to fill the remaining space.
Something like this:
const QUADWORD_SIZE: usize = 8;
const SIZE_OF_T: usize = mem::size_of::<T>();
if (SIZE_OF_T < QUADWORD_SIZE) && (QUADWORD_SIZE % SIZE_OF_T == 0) {
const COPIES_IN_QUADWORD: usize = QUADWORD_SIZE / SIZE_OF_T;
let quadword = [T::default(); COPIES_IN_QUADWORD];
let mut iter = slice.chunks_exact_mut(COPIES_IN_QUADWORD);
for chunk in iter {
movups(chunk, &quadword as *const T as *const u64);
}
slow_path(iter.remainder());
} else {
slow_path(slice);
}
Interesting idea, maybe this is a good use-case for https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut
Looking more at the code, it looks to me that we already have this impl -- maybe not directly relevant to OP:
/// Impl [`Zeroize`] on slices of types that can be zeroized with [`Default`].
///
/// This impl can eventually be optimized using an memset intrinsic,
/// such as [`core::intrinsics::volatile_set_memory`]. For that reason the
/// blanket impl on slices is bounded by [`DefaultIsZeroes`].
///
/// To zeroize a mut slice of `Z: Zeroize` which does not impl
/// [`DefaultIsZeroes`], call `iter_mut().zeroize()`.
impl<Z> Zeroize for [Z]
where
Z: DefaultIsZeroes,
{
fn zeroize(&mut self) {
assert!(self.len() <= isize::MAX as usize);
// Safety:
//
// This is safe, because the slice is well aligned and is backed by a single allocated
// object for at least `self.len()` elements of type `Z`.
// `self.len()` is also not larger than an `isize`, because of the assertion above.
// The memory of the slice should not wrap around the address space.
unsafe { volatile_set(self.as_mut_ptr(), Z::default(), self.len()) };
atomic_fence();
}
}
It's too bad that Vec<T>
and Box<[T]>
and [T; n]
don't seem to call to this impl, I guess that they can't because they don't have the DefaultIsZeroes
bound? And that's where we'd need negative trait bounds or specialization or something in order to make them able to do it in the cases when that bound does hold.
But maybe this is the place to do this kind of optimization work, and people can call foo.as_mut_slice().zeroize()
or whatever when the perf matters, for now.
The impls on slice types call volatile_set
, so that seems like the proper place to provide an optimized implementation for filling slices
Here's the thing -- I bet that volatile_set
is already going to be well understood by the compiler and that splitting this up for alignment is going to be done by llvm, so I feel like writing out the align_to_mut
stuff is probably going to be wasted effort. (Could look at code gen and see). I know I've seen llvm generate all this before for memset
calls.
Here's an idea: Maybe we can refactor things so that we don't have special-casing based on DefaultIsZeroes
trait?
What if we made Zeroize
like this:
pub trait Zeroize {
/// The bit pattern that we will set objects of this type to when zeroizing them
fn zero_pattern() -> Self;
/// Write the zero pattern over an object, can't be optimized away
fn zeroize(&mut self) {
volatile_write(self, Self::zero_pattern());
atomic_fence();
}
}
Then the idea is we could have impl <Z> Zeroize for [Z]
without any additional bounds using the same code as above, and it would use Z::zero_pattern()
rather than Z::default()
. And we would similarly be able to implement Zeroize
for Vec<Z>
, Box<[Z]>
, etc. in a way that calls to the optimized implementation for [Z]
.
I think we might not need to wait for specialization if we did that, or something along these lines.
First, you're proposing breaking changes to a post-1.0 crate which exposes a trait-oriented API. Those would only be accepted as a last resort.
But also, zeroize
is designed to work on structured types that don't necessarily implement Copy
. That's why zeroize
takes &mut self
: so it can mutate data in-place.
DefaultIsZeroes
exists to provide a strategy for zeroizing Copy
types by replacing them with the Default
value, but that approach doesn't work for non-Copy
types, and zeroize
is very much designed to support both.
Yeah -- I see that, it's going to get more complicated when you want the implementation for [Z]
not to assume that Z
is copy anymore, so that it can be parametric.
Intuitively, what I'd like to say is, what if we try to implement the general case where, Z
might have destructors, put calls to drop_in_place
etc. as appropriate, which will become no-ops when Z
is copy
. We could try to be parametric over all types. At least it's not obvious to me that "that approach doesn't work for non-copy types". Maybe it's obvious to you, you've thought about it more.
I understand it's a breaking change, just trying to think about directions that might make things simpler or better.
The big problem with any breaking change is like serde
, zeroize
contains traits which compose across many crates.
It has ~500 downstream dependencies, and any kind of upgrade needs to be coordinated across all of them.
So really breaking changes need to be reserved for things that absolutely must make them. And I don't see any reason that mandates such a drastic change.
it's going to get more complicated when you want the implementation for [Z] not to assume that Z is copy anymore
That's not an issue. The impl of Zeroize
for [Z]
already bounds on DefaultIsZeroes
, which in turn bounds on Copy
, Default
, and Sized
, so we can write an optimized implementation which does a simple copy. It was designed that way from the start.
The impl on IterMut
can be used to zeroize [Z]
where Z
is non-Copy
, i.e. non_copy_slice.iter_mut().zeroize()
.
I'm just trying to understand, even if we did optimize the impl Zeroize for [Z]: Z: DefaultIsZeroes
, what as a user would I have to do to actually get it.
Because if I have code like OP:
let mut x = [7u8; 32];
x.zeroize();
it's not going to call the zeroize function for slices, it's going to call the [T; n]
one.
Suppose I have code like
#[derive(Zeroize)]
struct A {
a: Vec<u8>
}
This is not going to call the impl Zeroize for [Z]: Z: DefaultIsZeroes
impl that we're proposing to optimize, which is kind of unfortunate. So that means that optimizing this stuff feels kind of like wasted effort until this is somehow improved.
In the meantime, from experiments it looks that
let mut x = [7u8; 32];
(&mut x as &mut [u8]).zeroize();
will convince the compiler to pick the faster implementation, so maybe we could do the optimization we're talking about and document that that's what you have to do as a user if zeroizing one byte at a time isn't good enough?
And for a struct like A
you probably have to implement Zeroize
by hand, and you probably only should do that if you're sure that your vec is not getting resized ever? Or you would have to marry the impl for Vec<Z>
with the optimizations we have for slices with DefaultIsZeroes
It would be nice if we could make this easier for the user of the crate somehow, but I guess I don't see a way if we're not interested in discussing zeroize 2.0 here. Thanks.
what as a user would I have to do to actually get it.
Borrow the value as a mutable slice, then call zeroize
.
It's a bit of a pain, but the best we can do without specialization.
guess I don't see a way if we're not interested in discussing zeroize 2.0 here
I still don't see anything worth making breaking changes over until specialization lands, and even then it would probably still make sense to have a specialization
feature which adds the specialized impls.
I inspected the generated assembly code and benchmarked
zeroize
for[u8; 32]
on x86_64 and found it quite inefficient, storing one byte at a time:https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=3f44f4b90e6af0eac0dcb1f649390329
On my Ryzen CPU, it takes ~7.8324 ns, or ~1cpb. Binary code size is also quite large.
Using inline assembly (just stabilized in 1.59) and SSE2, zeroing a
[u8; 32]
takes just 3 instructions and ~492.87 ps (~16 bytes per cycle):So it might be something worth optimizing/documenting.
If you do not want to use inline assembly, maybe you should encourage using larger types or SIMD types, e.g.,
[u64; 4]
or[__m128; 2]
instead of[u8; 32]
. Usingwrite_volatile
on*mut __m128
generates equally compact and efficient code as the assembly code above.