[Discussion] Quick sketch for unified chunk id and size handling

kitlith commented 6 months ago

In short, the concept is to make a generic struct that handles the shell of a chunk (the Type/Length/Value portion) so that all you have to worry about for each new chunk is "what's inside the chunk?" and not "what if the chunk is actually too small/large?"

Here's an implementation of that concept:

use core::fmt::Debug;

use binrw::{binrw, BinRead, BinWrite, PosValue, io::{TakeSeekExt, Cursor, Seek}};

#[binrw]
#[brw(little)]
#[derive(Debug, PartialEq)]
// const generics don't support array types yet, so let's just encode it into a u32
struct KnownChunk<const MAGIC: u32, T: for<'a> BinRead<Args<'a>=()> + for<'a> BinWrite<Args<'a>=()>> {
    #[br(temp, assert(id == MAGIC))]
    #[bw(calc = MAGIC)]
    id: u32,

    // TODO: calc by querying content + extra_bytes.len() when writing, or seeking back after you know
    size: u32,

    #[br(temp)]
    #[bw(ignore)]
    begin_pos: PosValue<()>,
    // ensure that we don't read outside the bounds for this chunk
    #[br(try, map_stream = |r| r.take_seek(size as u64))]
    content: Option<T>,
    #[br(temp)]
    #[bw(ignore)]
    end_pos: PosValue<()>,

    // calculate how much was read, and read any extra bytes that remain in the chunk
    #[br(count = size as u64 - (end_pos.pos - begin_pos.pos))]
    extra_bytes: Vec<u8>
}

// const fn fourcc(magic: &[u8; 4]) -> u32 {
//     u32::from_le_bytes(*magic)
// }

// type Md5Chunk = Chunk<{fourcc(b"MD5 ")}, u128>;

macro_rules! Chunk {
    ($magic:expr, $content:ty) => { KnownChunk<{ u32::from_le_bytes(*$magic)}, $content> }
}

type Md5Chunk = Chunk!(b"MD5 ", u128);

The main thing I consider to be missing in this implementation is automatically setting the size when writing the chunk. Unfortunately, binrw doesn't really provide an easy to do this right now. Something I remember considering doing in previous projects was manually creating a WriteLength/BinLength trait for the purpose, but there's no easy derive-driven tool as of yet.

Here's some tests to go with it:

fn roundtrip<T: for<'a> BinRead<Args<'a>=()> + for<'a> BinWrite<Args<'a>=()> + PartialEq + Debug>(test: &T) -> T {
    let mut buffer = Cursor::new(Vec::<u8>::new());
    test.write_le(&mut buffer).unwrap();
    buffer.seek(std::io::SeekFrom::Start(0)).unwrap();
    T::read_le(&mut buffer).unwrap()
}

fn roundtrip_assert<T: for<'a> BinRead<Args<'a>=()> + for<'a> BinWrite<Args<'a>=()> + PartialEq + Debug>(test: T) {
    assert_eq!(test, roundtrip(&test));
}

#[test]
fn chunk_test() {
    // undersized chunk results in entire content ending up in extra_bytes
    roundtrip_assert(Md5Chunk { size: 4, content: None, extra_bytes: vec![0; 4]});
    // oversized chunk results in extra bytes ending up in extra_bytes
    roundtrip_assert(Md5Chunk { size: 20, content: Some(64), extra_bytes: vec![1; 4]});
    // trivial case
    roundtrip_assert(Md5Chunk { size: 16, content: Some(128), extra_bytes: vec![] });

    // I was too lazy to create byte buffers for these tests, so I used the fact that size isn't automatically generated and the extra_bytes field to craft the buffers I wanted.

    // "extra bytes" become content
    assert!(roundtrip(&Md5Chunk { size: 16, content: None, extra_bytes: vec![0; 16]}).content.is_some());
    // undersized chunk does not pull later bytes, if available
    assert_eq!(roundtrip(&Md5Chunk { size: 4, content: None, extra_bytes: vec![0; 20]}).extra_bytes.len(), 4);
}

Apologies for the random code dump. Let me know if you have any questions, or have interest in taking this concept further.

briandorsey commented 6 months ago

Thank you so much for continuing to think about this and taking the time to post here! (or maybe apologies for accidentally nerd-sniping (https://xkcd.com/356/) you? :)

In short, the concept is to make a generic struct that handles the shell of a chunk 
(the Type/Length/Value portion) so that all you have to worry about for each new chunk 
is "what's inside the chunk?" and not "what if the chunk is actually too small/large?"

I really like this line of thinking. I'll spend some time this evening with your idea.

briandorsey commented 6 months ago

@kitlith I spent some time this evening integrating this pattern and it works! Thanks again for posting it.

This pushed me into seriously considering not including the id and size in the various chunk structs at all, making them fully logical representations, rather than a mix of logical and serialized info (especially size). And... since I'm already parsing chunks individually in a loop in metadata_chunks() (so that I can continue onward if parsing one chunk has an error)... I might as well just remove the id and size from all of the chunk structs and directly parse each of them with take_seek in the loop. That will work without generics on the chunks, making them a bit easier to debug (can see the chunk type directly instead of the u32 (due to const generic limitations)) and remove a layer of embedded structs. I think I'm going to give this approach a try next.

Apologies for not using your solution directly (for now at least). Thank you so much for taking the time to write it up and share it, though! It inspired me to think more clearly about the data and I think the end result will be much cleaner that it would have been otherwise.

kitlith commented 6 months ago

Apologies for not using your solution directly (for now at least).

No worries! As long as you found it thought provoking and useful, I'm good. My goal in these types of things is to model things to be unable to represent invalid/inconsistent states. (But this may not always be the correct approach to take, if you want to be capable of generating/reading malformed files! So there's a trade-off there that I may not have made clear.)

And... since I'm already parsing chunks individually in a loop in metadata_chunks() (so that I can continue onward if parsing one chunk has an error)... I might as well just remove the id and size from all of the chunk structs and directly parse each of them with take_seek in the loop.

If you're making parsing of chunks rely on that loop, doesn't that reverse your earlier desire to be able to read just a single chunk of a specific type? Perhaps you don't want to use the generic struct I provided in your main logic, but you could still use it for reading a single chunk of a specific type.

briandorsey commented 6 months ago

Ah, very good point. When parsing a single chunk, it will probably (and should) come with the chunk id and size. It's very easy to get over focused. I think I'm probably overdue to create examples which use the API a few different ways.

briandorsey commented 6 months ago

Thank you so much for the continued feedback and ideas @kitlith ! I created a branch & PR to iterate on these ideas. Please don't feel any obligation at all! But if you're so inclined, hopefully this makes it a bit easier.

So far, I've tried to consolidate a bunch of your suggestions into one macro based implementation. And got that working for the MD5 chunk. Some things probably got a bit lost along the way. But the current state is at least does have a single point of truth for specifying each chunk id.

kitlith commented 6 months ago

Closing this issue for now in favor of consolidating future discussion in #30

briandorsey / wavrw

[Discussion] Quick sketch for unified chunk id and size handling #25