PSeitz / lz4_flex

Fastest pure Rust implementation of LZ4 compression/decompression.
MIT License
460 stars 31 forks source link

Implementation of decompress_with_dict() is unsound #19

Closed Shnatsel closed 3 years ago

Shnatsel commented 3 years ago

The following code is unsound because it calls Vec::set_len() without initializing the vector first:

https://github.com/PSeitz/lz4_flex/blob/053483689099420c8576a378d8bf4e1d1fe2964f/src/block/decompress.rs#L484-L490

The documentation on set_len() states:

The elements at old_len..new_len must be initialized.

To make this operation sound, you need to treat the spare capacity of the Vec as MaybeUninit and only call .set_len() after all the elements have been initialized.

Sadly the convenience method to do so is nightly-only. If you roll your own, you will need to follow the stdlib implementation closely because deconstructing the Vec into pointers and especially having a &mut reference is tricky in terms of aliasing. Fortunately miri can verify its correctness for you.

arthurprs commented 3 years ago

The fact that this is being labeled as "unsound" is saddening. You are pointing to one of many usages of set_len here, it would require changing half the instances of the &[u8] in the library to be &[MaybeUninit<u8>]. That sounds silly. There must be a better way right?

Shnatsel commented 3 years ago

That sounds silly. There must be a better way right?

What makes you think so?

The documentation on .set_len() explicitly states that you are not allowed to use it that way, and that such use causes undefined behavior. I don't know what other proof do you need.

If the issue is pervasive, then all such code must be fixed.

The possible fixes are:

  1. View the capacity of the Vec as &mut [MaybeUninit<u8>] and only then call set_len
  2. Write to the capacity of the Vec via ptr::write and only then call set_len
  3. Use methods such as push, extend, or the newly stabilized extend_from_within and never call set_len
  4. Initialize the Vec before treating it as &mut [u8]. This can be zero-cost in some cases if lowered to a request for already-zeroed memory from the OS, e.g. using vec![0; len].
T-Dark0 commented 3 years ago

The fact that this is being labeled as "unsound" is saddening. You are pointing to one of many usages of set_len here, it would require changing half the instances of the &[u8] in the library to be &[MaybeUninit<u8>]. That sounds silly. There must be a better way right?

By the unsafe code guidelines, an API is unsound if it can cause UB without the user writing unsafe anywhere in their own code.

This API calls Vec::set_len, which is an unsafe method. Then, it violates the explicit contract of that method.

An unsafe method is, by definition, allowed to cause UB if an invariant is not upheld by the caller.

By putting these facts together, we can see that this code does something that can cause UB, and does not require the user to type unsafe. By definition of soundness in Rust, this code is Unsound.

arthurprs commented 3 years ago

Thanks for confirming :disappointed:

If the issue is pervasive, then all such code must be fixed.

Sadly it is, the "unsafe" encoder/decoder will try to skip initialization everywhere it can :disappointed:

PSeitz commented 3 years ago

Isn't this only an actual issue for structured data? Initializing a Vec<u8> with random bytes should be a valid initialization.

My understanding is that for invalid initialized data, it can be that the cpu accesses the data before you access it and interprets it, e.g. due to prefetching and reordering and this can cause chaos when the data is in an invalid state.

Shnatsel commented 3 years ago

Uninitialized memory is far trickier than "memory set to a single but arbitrary value". This is discussed in detail in https://www.ralfj.de/blog/2019/07/14/uninit.html

PSeitz commented 3 years ago

@Shnatsel thanks for the link, this is a really interesting read! Although I don't think this applies for data in a vec since the state machine can't possibly keep track of which parts of a vec are initialized and which not?

The example e.g. works with a vec https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=827b04cca344a35dba5b247e21acd43e

fn always_returns_true(x: u8) -> bool {
    x < 120 || x == 120 || x > 120
}

fn main() {
    let mut vecco:Vec<u8> = vec![];
    unsafe { vecco.set_len(1) };
    assert!(always_returns_true(vecco[0]));
}

This is addressed with 0.9.0 https://crates.io/crates/lz4_flex