genonullfree / stream-extractor

TCP Stream Extractor
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

High resource consumption on very large pcap files #14

Open Pommaq opened 4 months ago

Pommaq commented 4 months ago

My personal use case for this tool was to filter a very very large pcap file into smaller files representing individual TCP sessions, matching the extract command, it did succeed in doing so but I did notice a few issues that can be resolved in a relatively simple manner.

  1. It scales linearly (by my guesstimate) based on the number of streams that exist in the input pcap file.
  2. High RAM consumption, in my use it reached about 80GB.

Converting linear scaling to something closer to O(1) (guesstimate)

fn read_pcap(input: &str) -> Option<(PcapHeader, Vec<Stream>)> {
    /*snip*/
    let mut output = Vec::<Stream>::new();
    /*snip*/
    'nextpkt: while let Some(pkt) = pcap_reader.next_packet() {
        /*snip*/
        let pkt = pkt.unwrap();
        let packet = PPacket::new(&pkt);
        if let Some(eth) = EthernetPacket::new(&pkt.data) {
            // Validate it is an IPv4 packet
            if eth.get_ethertype() == EtherTypes::Ipv4 {
                // Validate it is a TCP packet and we have extracted it
                if let Some(si) = StreamInfo::new(&eth) {
                    if output.is_empty() {
                        /*snip*/
                        output.push(Stream::new(si, packet));
                        continue 'nextpkt;
                    } else {                        
                        // *This* is where the poor scaling comes from.
                        // Running this for each packet we receive, when we have 50000 causes a noticeable slowdown.
                        for s in output.iter_mut() { 
                            if s.is_stream(&si) {
                                s.add_and_update(si, packet);
                                continue 'nextpkt;
                            }
                        }

                        output.push(Stream::new(si, packet));
                }
            }
        }
    }
    /*snip*/

    if output.is_empty() {
        None
    } else {
        Some((header, output))
    }
}

The solution is to replace the vector with a hashmap, or a hashset, or similar. Basically something with O(1) lookup time. We can do this since "s.is_stream(&si)" simply compares StreamInfo, which is static data for the session.

Here is an example solution, note that we lose the order in which we saw each stream however, solution is to add an integer to "Stream", then just increase it each time no Stream was available in the map.

fn read_pcap(input: &str) -> Option<(PcapHeader, Vec<Stream>)> {
    /*snip*/
    // Note that we replace "output" with a hashmap. StreamInfo is immutable
    // once a session has been identified, so this is A-O.K.

    let mut entries: HashMap<StreamInfo, Stream> = HashMap::new();
    /*snip*/
    while let Some(pkt) = pcap_reader.next_packet() {
        /*snip*/
        let pkt = pkt.unwrap();
        let packet = PPacket::new(&pkt);

        if let Some(eth) = EthernetPacket::new(&pkt.data) {
            if eth.get_ethertype() == EtherTypes::Ipv4 {
                // Decode the packet
                if let Some(si) = StreamInfo::new(&eth) {

                    // Just hash it and try to get it.
                    if let Some(s) = entries.get_mut(&si) {
                        // Stream already exists. Extend it
                        s.add_and_update(si, packet);
                    } else {
                        // No matching stream, create one.
                        entries.insert(si, Stream::new(si, packet));
                    }
                }
            }
        }
    }
    /*snip*/

    if entries.is_empty() {
        None
    } else {
        Some((header, entries.into_values().collect()))
    }
}

High Ram consumption

This stems from this tool retaining all seen pcap packets in an internal vector and only writing them to disk once it's extracted everything.

The solution is to not do that. In the previously proposed hashmap simply map each StreamInfo to a channel Writer. Let the reading end(s) write data into the corresponding pcap file, or performing tallying or scanning of the pcap file depending on the subcommand issued. Adapting everything to this is simple since it'd essentially mean those ends would iterate over a Receiver instead of a vector.

The reader can be implemented by:

  1. Starting a thread for each pcap file, this does not scale super well but it's less resource hungry than today.
  2. Or having one reader thread that multiplexes over each Receiver, keeps track of which datasource belongs to what file and writes into it once it has received data.

I'd recommend the second option, it's slightly more complex but limits this tool to 2 threads (or 1 if it's implemented asyncronously but that rewrite sounds annoying).

genonullfree commented 4 months ago

Thank you for this detailed issue as well. This tool was originally hacked together quickly and so this issue doesn't really surprise me. Just wanted to respond to say this may take me longer to address than the other issue since it's a bit more complicated.

Pommaq commented 4 months ago

No probs, I'll throw together a solution :) I think I made a neat one although it slightly changes the prints from the tool