der: Huge generic bloat due to NestedReader

dishmaker commented 1 year ago

Hi!

My .exe file using der crate for a huge protocol is too large (12 MB for a simple DER encoder/decoder) I've just used cargo bloat and cargo llvm-lines on my protocol crate and was surprised that der crate makes a lot of generic functions.

read_nested is actually the main cause!

Lines                  Copies               Function name
-----                  ------               -------------
34633 (2.6%,  2.6%)    187 (1.0%,  1.0%)    der::reader::Reader::read_nested

Because it is nesting itself recursively!

NestedReader<NestedReader<NestedReader<SliceReader>>>

You can easily check by adding field test: FailAsn1 inside your #[derive(Sequence)]

pub struct FailAsn1 {}

impl<'a> DecodeValue<'a> for FailAsn1 {
    fn decode_value<R: Reader<'a>>(
        reader: &mut R,
        header: Header,
    ) -> Result<Self> {
        // prints: reader type name: NestedReader<NestedReader<NestedReader<NestedReader<NestedReader<SliceReader>>>>>
        panic!("reader type name: {}", tynm::type_name::<R>())
    }
}

Not only that Reader gets generic-bloated but also all Decode and DecodeValue traits.

Really! Just look above, DecodeValue is generic over R: Reader so all objects implementing Decode also get bloated.

But the main cause is that:

NestedReader takes Self that can be another NestedReader:

https://github.com/RustCrypto/formats/blob/086ecd7d27aa3f6c5c97dc2a33adc7159008086b/der/src/reader.rs#L127-L131

tarcieri commented 1 year ago

This is an extremely immaturely phrased issue. Please act like an adult when reporting issues to this project. Knock off the caps lock.

My .exe file is too large (12 MB for a simple DER encoder/decoder)

What .exe? I can't help you without seeing the code.

Please post a complete example that illustrates the issue.

tarcieri commented 1 year ago

I'll also note that on Linux/Mac, the binaries produced by the test cases in this repo are around 1MB-1.5MB, even with debug symbols. This includes particularly complicated cases like x509-cert.

I have no idea what's bloating your particular binary to 12MB.

dishmaker commented 1 year ago

Thanks for responding!

This is an extremely immaturely phrased issue.

Ok, I've rephrased the issue.

The problem is simple: NestedReader<NestedReader<NestedReader<SliceReader>>> should become at most: NestedReader<SliceReader> and never nest itself deeper. I know it might be hard to solve without dyn Reader.

My main app lists every possible sequence in an enum for ease of use, GUI and encoding/decoding from and to JSON. But let's use your example: x509-cert

#[derive(Choice)]
pub enum AnyX509Sequence {
    Certificate(Certificate),
    TbsCertificateInner(TbsCertificateInner),
    PkiPath(PkiPath),
    CertReqInfo(CertReqInfo),
    CertReq(CertReq),
    TrustAnchorInfo(TrustAnchorInfo),
    CertPathControls(CertPathControls),
    // and so on...
}

Therefore, every single sub-sequence gets the DecodeValue trait implemented for:

NestedReader<NestedReader<NestedReader<SliceReader>>>
NestedReader<NestedReader<SliceReader>>
NestedReader<SliceReader>
SliceReader

dishmaker commented 1 year ago

Another huge problem with nesting is error propagation with offsets.

.at(inner.offset()))

Is .nested() in a right place?

https://github.com/RustCrypto/formats/blob/086ecd7d27aa3f6c5c97dc2a33adc7159008086b/der/src/reader.rs#L59 It should be called in .read_nested(), since that's where nesting gets done.

I think it doesn't work since sometimes i'm seeing twice or even 3x larger index than input binary file itself. Maybe because context-specific calls .nested() twice.

To solve:

generic nesting
offset in errors

i'd suggest using always a wrapped NestedReader<PemReader> with a stack:

stack: Vec<NestedOffsetLength>

Then, offset propagation won't be needed since self.position is always valid.

pub struct NestedReader<'i, R> {
    /// Inner reader type.
    inner: &'i mut R,

    /// Current nest bounds are at: stack.pop()
    stack: Vec<NestedOffsetLength>,
}

pub struct NestedOffsetLength {
    /// Allowed read bounds
    /// Actually, only bounds.end gets used
    bounds: Range<Length>,
}

And entire trickery with nesting becomes a bound check:

    /// Checks if current nest range allows reading at least len bytes.
    /// 
    /// Returns error if there's not enough remaining data in the nested input.
    fn advance_position(&mut self, len: Length) -> Result<()> {
        let nest = self.stack.last_mut().unwrap();
        let new_position = (self.inner.position() + len)?;

        // Just a simple bounds check
        // No need to do offset math
        if new_position > nest.bounds.end {
            Err(ErrorKind::Incomplete {
                expected_len: len,
                actual_len: self.remaining_len(),
            }
            .at(self.inner.position())
            .into())
        } else {
            Ok(())
        }
    }

tarcieri commented 1 year ago

We can look into improving NestedReader in the next breaking release to reduce the number of combinations that are monomorphized, however your original reported issue was it bloating your binary to 12MB, and I think it's unlikely to cause that.

Yes, it monomorphizes NestedReader for each nesting level, and that's multiplied by the number of underlying reader types you're using, but the total combinatorial explosion isn't that large.

Even in a program which uses both SliceReader and PemReader, and nests to 10 levels, it would be monomorphizing 20 times. The input LOC for the entire nested.rs module is 96. It's not a lot of code.

tarcieri commented 1 year ago

Regarding this suggestion:

i'd suggest using always a wrapped NestedReader with a stack:
stack: Vec<NestedOffsetLength>

The core functionality of der (including read_nested/NestedReader) supports "heapless" no_std targets which don't link liballoc, so it can't depend on Vec.

dishmaker commented 1 year ago

The input LOC for the entire nested.rs module is 96. It's not a lot of code.

It's not many LOC, but #[derive(Sequence)] generates a lot of code - impl for Decode trait, which is generic over that 10-times monomorhized reader.

tarcieri commented 1 year ago

I think the best way to address this would be to get rid of NestedReader and make handling nested length boundaries an internal concern of each of the reader types, i.e. instead of Reader::read_nested taking an F: FnOnce(&mut NestedReader<'n, Self>), it can just take F: FnOnce(&mut Self).

This would duplicate some logic but it would strictly limit monomorphization to bytes versus PEM.

dishmaker commented 10 months ago

Consider the following structure:

#[derive(Choice)]
pub enum Example {
    ChoiceA(Null),
    ChoiceB(Vec<Example>),

    /// ... may contain more choices
}

It won't compile:

error[E0275]: overflow evaluating the requirement `SliceReader<'_>: Sized`
  |
  = help: consider increasing the recursion limit by adding a `#![recursion_limit = "256"]` attribute to your crate (`derdemo`)
  = note: required for `NestedReader<'_, SliceReader<'_>>` to implement `der::Reader<'_>`
  = note: 128 redundant requirements hidden
  = note: required for `NestedReader<'_, NestedReader<'_, NestedReader<'_, NestedReader<'_, ...>>>>` to implement `der::Reader<'_>`

dishmaker commented 10 months ago

I've made a refactor of der crate. https://github.com/dishmaker/formats/blob/0fd6ddfc1a3000b704eb6fe994de9ef49d3a470d/der/src/reader/nested.rs#L133-L148

/// Method of NestedDecoder, which no longer implements Reader
pub fn read_nested<T, F>(&mut self, len: Length, f: F) -> Result<T>
where
    F: FnOnce(&mut Self) -> Result<T>,
{
    // Save current position
    let old_end: Length = self.end_pos;

    // Swap end boundary with current nest
    self.end_pos = self.check_out_of_bounds(len)?;

    let ret = f(self);
    // Revert end position
    self.end_pos = old_end;

    // Return errors after resetting nested position
    self.finish(ret?)
}

It works (tests ok), but i had to use lifetimes everywhere. Especially lifetimes polluted the Decode and DecodeValue impls.

dishmaker commented 10 months ago

I've just found out that &mut NestedDecoder is a double-pointer That's sub-optimal for the CPU that de-references a &mut R.

A triple pointer would be with current der v0.7.8: &mut NestedReader<'_, NestedReader<'_, SliceReader<'_>>> just like &mut &mut &mut R

I suggest changing definition:

pub struct NestedDecoder<'i, R> {
    /// Inner reader type.
    inner: &'i mut R,

to:

pub struct NestedDecoder<R> {
    /// Inner reader type.
    inner: R,

That would also simplify lifetimes after my refactoring.

tarcieri commented 10 months ago

@dishmaker I already proposed simply getting rid of NestedReader, instead changing read_nested from a provided method to one each reader must impl (but they may be able to reuse parts of a common implementation).

It would be my preferred solution to this issue.

Unfortunately it's a breaking change and we're dealing with much more critical issues elsewhere in the project. I was planning on working on it in 2024.

dishmaker commented 10 months ago

I already proposed simply getting rid of NestedReader

Well, how would you track the outer nested boundary when entering another nest? Then the Reader trait needs to have something like:

fn set_nest_end(&mut self, end: Length)
fn get_nest_end(&self) -> Length

to keep track of outer bounds. I use the function call stack to remember the outer one in my NestedDecoder.

Also, i've just removed lifetimes from NestedDecoder and implemented:

#[derive(BitString)]
pub struct MyFlags {
    pub flag1: bool,
    pub flag2: bool,

https://github.com/dishmaker/formats/commit/0a344b2c2a27a94b0d2db78e665750e1e6a57662#diff-98e15f97ea64f72a4f09ecde11279914de7f6e362304268fd1d9c26b934378e6R467

tarcieri commented 10 months ago

Well, how would you track the outer nested boundary when entering another nest?

In the reader type itself.

Well, how would you track the outer nested boundary when entering another nest? [...]

I use the function call stack

That was my plan as well. It can store some state internally, and keep other state on the call stack in the read_nested method.

Then the Reader trait needs to have something like: set_nest_end/get_nest_end

Nothing like that is necessary, at least in the public API. It can all be an implementation detail of read_nested.

dishmaker commented 10 months ago

Nothing like that is necessary, at least in the public API. It can all be an implementation detail of read_nested.

It is necessary, because read_slice() and remaining_len() relies upon current nest bounds. Without wrapping reader inside another security/out-of-bounds structure, the reader may eat too many bytes.

Also, i've correctly implemented peeking by simplifying peek_byte into peek_bytes https://github.com/dishmaker/formats/blob/c5409e5a8d77a528b8810a43af1c3ed2545a1aa2/der/src/reader/nested.rs#L209

tarcieri commented 10 months ago

It is necessary, because read_slice() and remaining_len() relies upon current nest bounds.

Again, that can be stored in the reader type, which those methods are defined on.

tarcieri commented 10 months ago

Also, i've correctly implemented peeking by simplifying peek_byte into peek_bytes

I'm not sure your implementation is going to work on PemReader?

dishmaker commented 10 months ago

Again, that can be stored in the reader type

You can't store u32 in a trait. Forcing users to implement it on their own is painful. Eg. what if I want to implement my HexReader? All nesting checks in read_slice() are where then?

I'm not sure your implementation is going to work on PemReader?

I make sure there's always 8 bytes remaining to peek: https://github.com/dishmaker/formats/blob/c5409e5a8d77a528b8810a43af1c3ed2545a1aa2/der/src/reader/pem.rs#L119

tarcieri commented 10 months ago

You can't store u32 in a trait.

You store u32 (or rather, Length), in a type which impls the Reader trait, and query it via the accessor methods you named (e.g. remaining_len), which as trait methods have access to the type's internal state. I'm not sure why this is confusing?

Forcing users to implement it on their own is painful.

This isn't a trait which is really meant to have a whole lot of downstream impls by other users.

Eg. what if I want to implement my HexReader?

It would be good to add HexReader to the der crate itself. Otherwise, I'm not sure what other Reader impls would even make sense.

I make sure there's always 8 bytes remaining to peek

That's a similar approach used by peek_header

dishmaker commented 10 months ago

Otherwise, I'm not sure what other Reader impls would even make sense.

GzipReader, ZlibReader, ZstdReader, ZstdPemReader, AesDeflateReader etc..

And most importantly, my: DebugSliceReader which tracks a lot of things to make debugging huge structs/choices easier.

That's a similar approach used by peek_header

But having both peek_byte and peek_header isn't elegant. Another function forced onto the trait user's impl.

tarcieri commented 10 months ago

GzipReader, ZlibReader, ZstdReader, ZstdPemReader, AesDeflateReader etc..

If you actually have legitimate use cases for these, feel free to open a separate issue, but it seems more like you are inventing use cases for the sake of argument.

But having both peek_byte and peek_header isn't elegant. Another function forced onto the trait user's impl.

They accomplish two different purposes. I’m not sure why you would even say this. You want to foist complexity onto downstream consumers for no reason?

dishmaker commented 10 months ago

for the sake of argument

Of course AesDeflateReader is my imaginery reader, but someone might want to implement it in their crate.

They accomplish two different purposes.

Edit: moved to #1279

tarcieri commented 10 months ago

@dishmaker please open a separate issue for that, because it’s irrelevant to this problem

RustCrypto / formats

der: Huge generic bloat due to NestedReader #1228