Zeroize performance on u8 arrays

blckngm commented 2 years ago

I inspected the generated assembly code and benchmarked zeroize for [u8; 32] on x86_64 and found it quite inefficient, storing one byte at a time:

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=3f44f4b90e6af0eac0dcb1f649390329

On my Ryzen CPU, it takes ~7.8324 ns, or ~1cpb. Binary code size is also quite large.

Using inline assembly (just stabilized in 1.59) and SSE2, zeroing a [u8; 32] takes just 3 instructions and ~492.87 ps (~16 bytes per cycle):

let mut buf: [u8; 32];
core::arch::asm!(
    "xorps {zero}, {zero}",
    "movups {zero}, ({ptr})",
    "movups {zero}, 16({ptr})",
    zero = out(xmm_reg) _,
    ptr = in(reg) &mut buf,
    options(att_syntax, nostack, preserves_flags),
);

So it might be something worth optimizing/documenting.

If you do not want to use inline assembly, maybe you should encourage using larger types or SIMD types, e.g., [u64; 4] or [__m128; 2] instead of [u8; 32]. Using write_volatile on *mut __m128 generates equally compact and efficient code as the assembly code above.

tarcieri commented 2 years ago

@sopium we've done practically no optimization work, so I'm glad you're thinking about it.

Inline assembly is fine, but would need to be feature gated due to the higher MSRV (we can eventually consider bumping it, but our last bump was already painful)

For something like [T; N] though, it would be good to ensure this doesn't monomorphize to N distinct blobs of ASM.

blckngm commented 2 years ago

It boils down to LLVM not able to optimize write_volatile for u8 pointers. The bytes have to be written individually.

I think monomorphize for 16/32/64-byte arrays should be fine. These would also be commonly used key sizes in cryptography so the optimization can be justified.

For a u8 slice with unknown but largish size you can also use align_to_mut:

pub fn zeroize_slice(x: &mut [u8]) {
    unsafe {
        // Use __m128 on x86/x86_64.
        let (p, m, s) = x.align_to_mut::<u64>();
        for b in p {
            core::ptr::write_volatile(b, 0);
        }
        for m in m {
            core::ptr::write_volatile(m, 0);
        }
        for b in s {
            core::ptr::write_volatile(b, 0);
        }
    }
}

tarcieri commented 2 years ago

The use of write_volatile was something of a hack to ensure that the writes would not be elided by the compiler.

But inline assembly should also ensure that as well, and permit a highly performant implementation.

blckngm commented 2 years ago

The use of write_volatile was something of a hack to ensure that the writes would not be elided by the compiler.

But inline assembly should also ensure that as well, and permit a highly performant implementation.

Yes, I understand this. If it's “normal” write in a loop LLVM would be able recognize it and generate efficient instructions or call memset.

tarcieri commented 2 years ago

Yeah the typical way to implement a zeroization primitive efficiently is some sort of volatile memset typically realized by using some kind of compiler fence, often implemented as invoking memset in conjunction with an assembly-based dataflow optimization barrier.

While Rust has bindings to LLVM's volatile memset intrinsics, they are perma-unstable and no work has been done on a stable API AFAIK:

https://doc.rust-lang.org/std/intrinsics/fn.volatile_set_memory.html

But perhaps inline assembly could provide the next best thing, albeit on a target architecture-by-architecture basis.

newpavlov commented 2 years ago

We can not use write_volatile on *mut __m128 which we got from [u8; 32], since the pointer could be insufficiently aligned. AFAIK we do not have unaligned version of the volatile write. As a hack we could split [u8; 32] into aligned and unaligned parts. In theory, compiler should be able to remove the unaligned code parts if the array is sufficiently aligned.

blckngm commented 2 years ago

We can not use write_volatile on *mut __m128 which we got from [u8; 32], since the pointer could be insufficiently aligned. AFAIK we do not have unaligned version of the volatile write. As a hack we could split [u8; 32] into aligned and unaligned parts. In theory, compiler should be able to remove the unaligned code parts if the array is sufficiently aligned.

I was suggesting using [__m128; N] from the start, and transmuting to &[u8; N * 16] when necessary.

As for splitting to aligned and unaligned parts, that's exactly what align_to_mut does.

blckngm commented 2 years ago

Yeah the typical way to implement a zeroization primitive efficiently is some sort of volatile memset typically realized by using some kind of compiler fence, often implemented as invoking memset in conjunction with an assembly-based dataflow optimization barrier.

Oh this reminds me, now that we have stable inline assembly, this can be done too, and it's not architecture dependent:

pub fn zeroize(x: &mut [u8]) {
    for b in x.iter_mut() {
        *b = 0;
    }
    unsafe {
        core::arch::asm!(
            "/* {ptr} */",
            ptr = in(reg) x.as_mut_ptr(),
            options(nostack, readonly, preserves_flags),
        );
    }
}

pub fn f() {
    let mut x = [0u8; 32];
    zeroize(&mut x);
}

Playground: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=32fc07b8173bd2ce233eaa9bf0d8d81a

zeroize in f is optimized to two movaps but is not optimized out.

cbeck88 commented 1 year ago

@sopium I have a question about the use of readonly in the assembly options. (This is mostly for my own understanding.)

I watched a talk by Chandler Carruth (major contributor to LLVM) a long time ago, where he described creating similar functions in order to defeat the optimizer for benchmarking purposes: https://www.youtube.com/watch?v=nXaxk27zwlk&t=2476s

static void escape(void * p) {
    asm volatile("" : : "g"(p) : "memory")
}

I realize that this is a C++ talk and not rust, but they are both going through LLVM in the end so there's a lot of overlap.

In his talk he seems to say that it's important for the optimizer to believe that (1) This asm has side-effects that the optimizer cannot know about. In C++ that's done with "volatile" asm, and in rust I think that's done by not marking your asm pure (2) This asm potentially clobbers all memory. So, any reads from memory after this point have to actually go back to memory and any values cached in registers are dirty.

In your block you use the readonly flag and I'm trying to understand what this does and why this is good.

We are telling the optimizer, this assembly is not pure, it has some "crazy unknowable side-effect" in chandler's words
However, we are also telling the optimizer, this assembly is readonly and doesn't write to memory
It also preserves flags.

So, the optimizer has to believe that

This assembly can read from the memory which we are supposed to have zeroes
This assembly has unknowable side-effects, not affecting memory, but something else (flags? some CPU state?) that it has to care about, and whatever that something is might become wrong if we don't actually write the zeroes beforehand
Afterwards, memory has not been clobbered, and flags have not been changed, so we do not have to re-read any cached values in registers or modify any flags.

So 3 is good because it will help the efficiency of code using this block.

But it's hard for me to understand, if we're promising the optimizer that flags are not changed, registers are not changed (only the in(reg) for ptr is read from), and memory is not changed, what is the side-effect that the compiler still has to be worried about that prevents it from dropping this block? Or the model is just that, it still has to assume that there might be something beyond this that it doesn't understand?

newpavlov commented 1 year ago

@cbeck88 AFAIK readonly applies only to memory of the current process, the asm block from compiler's PoV may in theory perform a syscall and it will be well in bounds of the asm! options contract.

cbeck88 commented 1 year ago

I see, thank you

cbeck88 commented 1 year ago

It looks to me that one barrier to actually creating a patch that does this is the need for some form of the specialization language feature.

We currently have these impl's:

/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].
impl<Z, const N: usize> Zeroize for [Z; N]
where
    Z: Zeroize,
{
    fn zeroize(&mut self) {
        self.iter_mut().zeroize();
    }
}

https://github.com/RustCrypto/utils/blob/91c2c216a5017d444dfd718c198ad3a3e31452c6/zeroize/src/lib.rs#L361

/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].
impl<Z, const N: usize> Zeroize for [Z; N]
where
    Z: Zeroize,
{
    fn zeroize(&mut self) {
        self.iter_mut().zeroize();
    }
}

https://github.com/RustCrypto/utils/blob/91c2c216a5017d444dfd718c198ad3a3e31452c6/zeroize/src/lib.rs#L461

If we wanted to also add impls for the special case that Z = u8, then these will be overlapping impls, and we obviously cannot remove the existing impls.

I had this same problem in a hashing library: https://github.com/mobilecoinfoundation/mobilecoin/blob/b07c17fd2e1946c15b3a38c30b0fb1dc4ab516e3/crypto/digestible/src/lib.rs#L218

What I decided to do there was, simply not support my trait on the primitive type u8, so that I could have my nice impl for [T] and a nice impl for [u8]. Since it's extremely rare that someone actually needs to put a single u8 in a struct, I never actually had a problem, and I was able to do this kind of optimization without waiting for specialization to land.

In your case you already shipped a stable API which supports u8 so you probably don't want to do this approach.

tarcieri commented 1 year ago

u8 is pretty commonly used as a carrier for masked boolean values when writing constant-time code

cbeck88 commented 1 year ago

Good point

So do you see a path forward for this? Or just wait for specialization?

It seems to me we could make an optimized "zeroize_bytes" function based on the approach here, and people could either implement Zeroize manually and call out to it, or we could make like a #[zeroize(with = ...)] attribute for the zeroize-derive proc macro? Similar to what they do in serde? Is that something you would support?

tarcieri commented 1 year ago

A free function works, although is somewhat inelegant.

It might be possible to optimize the impl on [T] generically, packing as many copies of <T as Default>::default() as can fit into a quadword, then writing quadwords with movups until the remaining space in the slice is less-than-quadword sized, then falling back to writing them one-at-a-time in order to fill the remaining space.

Something like this:

const QUADWORD_SIZE: usize = 8;
const SIZE_OF_T: usize = mem::size_of::<T>();

if (SIZE_OF_T < QUADWORD_SIZE) && (QUADWORD_SIZE % SIZE_OF_T == 0) {
    const COPIES_IN_QUADWORD: usize = QUADWORD_SIZE / SIZE_OF_T;

    let quadword = [T::default(); COPIES_IN_QUADWORD]; 
    let mut iter = slice.chunks_exact_mut(COPIES_IN_QUADWORD);

    for chunk in iter {
        movups(chunk, &quadword as *const T as *const u64);
    }

    slow_path(iter.remainder());
} else {
    slow_path(slice);
}

cbeck88 commented 1 year ago

Interesting idea, maybe this is a good use-case for https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut

Looking more at the code, it looks to me that we already have this impl -- maybe not directly relevant to OP:

https://github.com/RustCrypto/utils/blob/91c2c216a5017d444dfd718c198ad3a3e31452c6/zeroize/src/lib.rs#L474

/// Impl [`Zeroize`] on slices of types that can be zeroized with [`Default`].
///
/// This impl can eventually be optimized using an memset intrinsic,
/// such as [`core::intrinsics::volatile_set_memory`]. For that reason the
/// blanket impl on slices is bounded by [`DefaultIsZeroes`].
///
/// To zeroize a mut slice of `Z: Zeroize` which does not impl
/// [`DefaultIsZeroes`], call `iter_mut().zeroize()`.
impl<Z> Zeroize for [Z]
where
    Z: DefaultIsZeroes,
{
    fn zeroize(&mut self) {
        assert!(self.len() <= isize::MAX as usize);

        // Safety:
        //
        // This is safe, because the slice is well aligned and is backed by a single allocated
        // object for at least `self.len()` elements of type `Z`.
        // `self.len()` is also not larger than an `isize`, because of the assertion above.
        // The memory of the slice should not wrap around the address space.
        unsafe { volatile_set(self.as_mut_ptr(), Z::default(), self.len()) };
        atomic_fence();
    }
}

It's too bad that Vec<T> and Box<[T]> and [T; n] don't seem to call to this impl, I guess that they can't because they don't have the DefaultIsZeroes bound? And that's where we'd need negative trait bounds or specialization or something in order to make them able to do it in the cases when that bound does hold.

But maybe this is the place to do this kind of optimization work, and people can call foo.as_mut_slice().zeroize() or whatever when the perf matters, for now.

tarcieri commented 1 year ago

The impls on slice types call volatile_set, so that seems like the proper place to provide an optimized implementation for filling slices

cbeck88 commented 1 year ago

Here's the thing -- I bet that volatile_set is already going to be well understood by the compiler and that splitting this up for alignment is going to be done by llvm, so I feel like writing out the align_to_mut stuff is probably going to be wasted effort. (Could look at code gen and see). I know I've seen llvm generate all this before for memset calls.

Here's an idea: Maybe we can refactor things so that we don't have special-casing based on DefaultIsZeroes trait?

What if we made Zeroize like this:

pub trait Zeroize {
    /// The bit pattern that we will set objects of this type to when zeroizing them
    fn zero_pattern() -> Self;

    /// Write the zero pattern over an object, can't be optimized away
    fn zeroize(&mut self) {
        volatile_write(self, Self::zero_pattern());
        atomic_fence();
    }
}

Then the idea is we could have impl <Z> Zeroize for [Z] without any additional bounds using the same code as above, and it would use Z::zero_pattern() rather than Z::default(). And we would similarly be able to implement Zeroize for Vec<Z>, Box<[Z]>, etc. in a way that calls to the optimized implementation for [Z].

I think we might not need to wait for specialization if we did that, or something along these lines.

tarcieri commented 1 year ago

First, you're proposing breaking changes to a post-1.0 crate which exposes a trait-oriented API. Those would only be accepted as a last resort.

But also, zeroize is designed to work on structured types that don't necessarily implement Copy. That's why zeroize takes &mut self: so it can mutate data in-place.

DefaultIsZeroes exists to provide a strategy for zeroizing Copy types by replacing them with the Default value, but that approach doesn't work for non-Copy types, and zeroize is very much designed to support both.

cbeck88 commented 1 year ago

Yeah -- I see that, it's going to get more complicated when you want the implementation for [Z] not to assume that Z is copy anymore, so that it can be parametric.

Intuitively, what I'd like to say is, what if we try to implement the general case where, Z might have destructors, put calls to drop_in_place etc. as appropriate, which will become no-ops when Z is copy. We could try to be parametric over all types. At least it's not obvious to me that "that approach doesn't work for non-copy types". Maybe it's obvious to you, you've thought about it more.

I understand it's a breaking change, just trying to think about directions that might make things simpler or better.

tarcieri commented 1 year ago

The big problem with any breaking change is like serde, zeroize contains traits which compose across many crates.

It has ~500 downstream dependencies, and any kind of upgrade needs to be coordinated across all of them.

So really breaking changes need to be reserved for things that absolutely must make them. And I don't see any reason that mandates such a drastic change.

it's going to get more complicated when you want the implementation for [Z] not to assume that Z is copy anymore

That's not an issue. The impl of Zeroize for [Z] already bounds on DefaultIsZeroes, which in turn bounds on Copy, Default, and Sized, so we can write an optimized implementation which does a simple copy. It was designed that way from the start.

The impl on IterMut can be used to zeroize [Z] where Z is non-Copy, i.e. non_copy_slice.iter_mut().zeroize().

cbeck88 commented 1 year ago

I'm just trying to understand, even if we did optimize the impl Zeroize for [Z]: Z: DefaultIsZeroes, what as a user would I have to do to actually get it.

Because if I have code like OP:

let mut x = [7u8; 32];
x.zeroize();

it's not going to call the zeroize function for slices, it's going to call the [T; n] one.

Suppose I have code like

#[derive(Zeroize)]
struct A {
   a: Vec<u8>
}

This is not going to call the impl Zeroize for [Z]: Z: DefaultIsZeroes impl that we're proposing to optimize, which is kind of unfortunate. So that means that optimizing this stuff feels kind of like wasted effort until this is somehow improved.

In the meantime, from experiments it looks that

let mut x = [7u8; 32];
(&mut x as &mut [u8]).zeroize();

will convince the compiler to pick the faster implementation, so maybe we could do the optimization we're talking about and document that that's what you have to do as a user if zeroizing one byte at a time isn't good enough?

And for a struct like A you probably have to implement Zeroize by hand, and you probably only should do that if you're sure that your vec is not getting resized ever? Or you would have to marry the impl for Vec<Z> with the optimizations we have for slices with DefaultIsZeroes

It would be nice if we could make this easier for the user of the crate somehow, but I guess I don't see a way if we're not interested in discussing zeroize 2.0 here. Thanks.

tarcieri commented 1 year ago

what as a user would I have to do to actually get it.

Borrow the value as a mutable slice, then call zeroize.

It's a bit of a pain, but the best we can do without specialization.

guess I don't see a way if we're not interested in discussing zeroize 2.0 here

I still don't see anything worth making breaking changes over until specialization lands, and even then it would probably still make sense to have a specialization feature which adds the specialized impls.

RustCrypto / utils

Zeroize performance on u8 arrays #743