JulianSchmid / etherparse

A rust library for parsing ethernet & ethernet using protocols.
Apache License 2.0
285 stars 54 forks source link

Thoughts on a PacketHeaders implementation without a &[u8] reference for payload? #40

Open robs-zeynet opened 1 year ago

robs-zeynet commented 1 year ago

I post this not necessarily as a bug or feature request, but for honest discussion. I'm not really a rust expert but I'm trying to learn and am curious if others have similar issues.

I commonly couple etherparse with the rust-pcap (https://docs.rs/pcap/latest/pcap/) library and write code similar to this:

fn main() {
    let mut cap = Device::lookup().unwrap().open().unwrap();

    let connection_tracker = ConnectionTracker::new(....); // custom TCP connection tracking lib

    while let Ok(packet) = cap.next_packet() {
        if let Ok(parsed) = ParsedHeaders::from_ethernet(packet.data) {
                  connection_tracker.process(parsed);
        }
    }
}

struct ConnectionTracker {
      cached_pkts: Vec<PacketHeaders>,
      // ....
}

impl ConnectionTracker {

     fn process(&mut self, headers: PacketHeaders) {
              if interesting(&headers) {
                     // cache the packet for later inspection
                     self.cached_pkts.push_back(headers);
              }
     }
}

... but of course this code doesn't work because PacketHeaders contains a payload member which is a &[u8] slice, which means that one has to declare the specific lifetime for anything that stores PacketHeaders (https://docs.rs/etherparse/0.13.0/etherparse/struct.PacketHeaders.html), which eventually forces ConnectionTracker to be in the 'static lifetime which is not what I want.

How do others work around this? I'd love tof this to be just a dumb thing that I don't get about Rust.

I'm currently hacking around this by declaring a new struct "SimplePacket" which is member for member the same as PacketHeaders, except that 'payload' is a Vec[u8] rather than a &[u8]. I understand that the &[u8] is for performance and prevents the code from needing a memcpy() under the covers, but I find it really hard to work around and was, ultimately - in a long winded way - if other people have the same issue, what they do, and if it's worth while considering either changing PacketHeaders to remove the &[u8] or maybe declaring a new (third!) packet holding struct simillar to the SimplePacket that I did in my code.

Feedback welcome and thank you again for the great library!

JulianSchmid commented 1 year ago

Hi @robs-zeynet ,

I agree there is potential to add support for this use case to the library. But I will first have to handle https://github.com/JulianSchmid/etherparse/issues/35 , which will require a bit of a redesign how "payloads" are handled. So I would takle https://github.com/JulianSchmid/etherparse/issues/35 first and this only after the payload re-design is done. Otherwise the chances are simply too high that things have to be redone as the structure has changed.

Initial Thoughts

I myself also ran into cases where I wished for some way to store packets for later processing. And if you have to store the packets, I agree then there is no other way than repackaging the data and using a Vec or other data-structure that owns the data. Offering a simple method to_owned() that allows you to transform a SlicedPacket or a PacketHeaders to a type that owns the data and that can be stored feels right to me.

But as I kind of tried to keep the library "allocation free", I would say a default feature should be added so that users of the crate can choose to "opt-out" of the allocating parts of the code.

Next I see the following use cases that could be supported:

Based on that I see the following options:

Option 1: Store the deserialized headers & payload

This is more or less what you are proposing. Adding a new struct that owns the data. Something along the lines of:

pub struct Packet {
    // Ethernet II header if present.
    pub link: Option<Ethernet2Header>,
    /// Single or double vlan headers if present.
    pub vlan: Option<VlanHeader>,
    /// IPv4 or IPv6 header and IP extension headers if present.
    pub ip: Option<IpHeader>,
    /// TCP or UDP header if present.
    pub transport: Option<TransportHeader>,
    /// Rest of the data that could not be decoded.
    pub payload: Vec<u8>,
}

Now I think the library should offer something like this but there is a thing I dislike: IpHeader actually has a very high memory footprint to support all possible IPv6 extension headers.

Version6(Ipv6Header, Ipv6Extensions)
  |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ this variant is 9280 bytes

This means a there is a 9k memory price to pay for every packet that we store and it is not obvious to the user that this will be the case.

But if we want to allow for modification, I think there is currently no better option. At least not as long as there is no mutable slice implementation. And I am not quite yet ready to open up that bag of worms.

Option 2: Store the raw packet

Now this might sound incredibly stupid, but we can just store a Vec containing the complete raw original packet for later processing and then just "re-do" the decoding when needed. The obvious tradeoff is that we will have to re-do the unpacking later on when we want to do stuff with it.

I think getting out raw data form a SlicedPacket or a PacketHeaders with a method to_raw_owned() could be useful. But to be honest most users could already do that by calling to_owned() on the slice they used to decoded the message. So I see no priority right now to add methods like these to the library to resolve this ticket. Especially as the user can do it himself & as inconsistencies in PacketHeaders (e.g. contradicting data) is a problem I would like to takle separately.

Option 3: Store the raw packet together with the slices pointing to the different parts of the packet

Here the raw packet in Vec from would be stored together with a SlicedPacket. Something along the lines:

struct Packet {
    /// Raw packet
    data: Vec<u8>,
    /// Slices pointing into the data part.
    slice: SlicedPacket,
}

But there is a problem: The borrow checker really dislikes it if you store a slice to a Vec in the same struct as the Vec itself. And with good reason, as if you would dumbly "clone" struct you would get a slice pointing back to the original vec. Which would obviously be bad. We probably can get again get around that by implementing our own "clone" and doing a bunch of unsafe stuff, but I would instead go with the option 4 solution.

Option 4: Store the raw packet together with the offsets pointing to the different parts of the packet

In this option we would store the offsets the different headers start & end together with the raw packet data and offering a method to get a SlicedPacket out of it again:

struct Packet {
    data: Vec<u8>,
    offsets: HeaderOffsets, // offsets in data of the different headers
}

impl Packet {
    pub fn sliced<'a>(&'a self) -> SlicedPacket<'a> {
        // create a sliced packet from the offsets
        ...
    }
}

Converting the offsets back to slices might need some carefully tested unsafe code to be fast. But that is still better than actively bypassing the borrow checker.

Now this is the way I most probably would choose if I don't have to modify a packet. Compared to Option 1 it is less memory hungry and compared to option 2 it also preserves the decoded packet information. We are also not actively fighting the borrow checker like option 3 and potentially could even save some memory compared to option 3 as some headers have a fixed length and need no extra "end offset" or "lenght".

Final Thoughts

I would probably always go for option 4 but I think there is also a case to be made to support option 1 as well. Option 1 would be a good choice for people that want to modify the packet and option 4 is the right choice if you care more about the memory footprint.

So I guess the answer is that option 1 & 4 should be added.

robs-zeynet commented 1 year ago

Thanks for the detailed comments! Fwiw, in other projects what I've done is implemented your Option 1 and waited to implement Option 4 until there was a real need (which typically never came). In other words, you can either take this as me avoiding premature optimization or being lazy - your choice :-)

But certainly for my use-case, Option 1 is sufficient. In fact, reading through the code, the private struct PacketImpl already does 90+% of what I'm looking for, and would just need to be refactored to include the payload and made public.

How about this: if you think this is roughly inline with something you could accept, I can try to throw a pull-request up for you to consider. Feedback welcome... and thanks again for the great library.