heap_bytes is inaccurate

FH0 commented 3 years ago

I need to load a file containing 70,000 pieces of data.

If I comment out aho_corasick::AhoCorasick::new, the program occupies 7M of RES.
If I do not comment aho_corasick::AhoCorasick::new, the program will occupy 89M of RES at the beginning and 5M of RES after a period of time.

Is there a way to reduce peak memory usage to run on resource-constrained devices?

BurntSushi commented 3 years ago

See https://docs.rs/aho-corasick/0.7.18/aho_corasick/struct.AhoCorasick.html#resource-usage

And: https://docs.rs/aho-corasick/0.7.18/aho_corasick/struct.AhoCorasick.html#method.heap_bytes

Here is where the main construction algorithm happens:

https://github.com/BurntSushi/aho-corasick/blob/4499d7fdb41c6279ea1367bccf3daca9cb06c36c/src/nfa.rs#L671

Patches that decrease memory usage are welcome, although it would be good to discuss them first.

I didn't go out of my way to use gratuitous amount of memory though.

Let's step through the code. Here is where a new state is created. In the worst case, a new state is created for every byte of input:

https://github.com/BurntSushi/aho-corasick/blob/4499d7fdb41c6279ea1367bccf3daca9cb06c36c/src/nfa.rs#L710-L728

Here is the implementation of add_state:

https://github.com/BurntSushi/aho-corasick/blob/4499d7fdb41c6279ea1367bccf3daca9cb06c36c/src/nfa.rs#L1189-L1195

This uses a "dense" or more memory hungry representation only when the depth (which is just the index of the current byte of the current pattern) is less than the configured dense depth. You could decrease this, but the default value is already pretty low at 2. So I think this means it will cause a maximum of 1 + 256 dense states to exist. The representation of a dense state tells us how much memory it uses: size_of::<S>() * 256, where S = usize by default. So on a 64-bit system, that's 8 * 256 = 2KB per state, and thus, a worst case of 2KB * 257 = ~526KB just for the dense states alone. But this is somewhere around 1% of the memory usage that has you concerned. So let's continue.

As the implementation of add_state shows, if a state isn't dense, then it's sparse. A sparse state has this representation:

https://github.com/BurntSushi/aho-corasick/blob/4499d7fdb41c6279ea1367bccf3daca9cb06c36c/src/nfa.rs#L391

If a sparse state has 256 transitions, then it will actually use more memory than a dense state at 256 * (size_of::<u8>() + size_of::<S>()) = 2.304KB for S = usize on a 64-bit system. However, the premise of a sparse state is that this case is pretty rare, and the more common case is a very small number of transitions. So, for, say, an average of 10 transitions per sparse state, you'll get memory usage of 10 * (size_of::<u8>() + size_of::<S>()) = 90 bytes.

In addition to this though, there is also the overhead of each State. Here is its representation:

https://github.com/BurntSushi/aho-corasick/blob/4499d7fdb41c6279ea1367bccf3daca9cb06c36c/src/nfa.rs#L290-L300

Assuming matches is empty in about 90% of states, in the common case, the overhead of a state is just its size. Eyeballing it, with S = usize on a 64-bit system, it looks to be about 70 bytes or so?

So let's bring all of this together. You unfortunately didn't really provide enough information to make a precise calculation (indeed, an executable reproduction would be ideal), I'll assume that you have 70,000 patterns of length around 10. So in the worst case (every byte gets its own state), you end up with around 700,000 states. Just looking at the overhead of State alone, we get 70 * 700,000 = 49 MB right there. At that point, it's probably easy to see how you could get to 89 MB just based on the size of your data. And this is just the size of the automaton itself. IIRC, there is some ancillary storage used at construction that is thrown away later, but since you didn't provide any code or really any details about how you're constructing the automaton, I don't know if it's applicable to your case.

One interesting bit here is that you claim that memory usage actually decreases after a period of time. Since the memory calculation above is mostly just about the NFA itself, it's not clear to me how memory usage could drop as much as you report without freeing the NFA itself.

Other than perhaps using a smaller S (although, I'm guessing your data might be too big to go lower than u32), I don't see any obvious changes to the code to decrease memory usage substantially. Perhaps there are better ways to build the trie itself though.

resource-constrained

Have you considered using regex-automata to build a DFA offline and then embed it into your executable? If you can spend extra compute, you can even minimize it, which when combined with the various space saving techniques in that crate, will likely get you pretty dang close to the theoretically minimal memory usage needed to represent your data as a finite state machine. Of course, when using regex-automata, it will use much more memory during construction time, so this is only viable if you can do it in an offline manner. The bstr crate uses this technique internally if you want to look at an example.

FH0 commented 3 years ago

Thank you for your detailed answer.

I tested a lot of code in WSL2, and I couldn't restore the drastic reduction in memory. This situation is what I encountered in the windows system.

Sorry I did not provide the corresponding test code.

use std::{thread::sleep, time::Duration};

fn main() {
    // generate data
    let mut data_vec = Vec::new();
    for i in 0..70_000 {
        data_vec.push(format!("{:0<15}", i));
    }

    // build
    let ac = aho_corasick::AhoCorasick::new(data_vec);
    println!("{}", ac.heap_bytes());

    // block forever
    loop {
        sleep(Duration::from_secs(1));
    }
}

Unfortunately, the results of the test are not as I said above. I don't want to test on windows system, because resource-constrained devices will not run windows system.

In my test, the test program output 12393232 and occupied 118848KB of RES.

BurntSushi commented 3 years ago

OK, so my analysis above looks correct then?

If there's a bug, it's that heap_bytes isn't accurate here. Indeed, it looks like it doesn't count the sparse state memory usage correct.

(I cannot make sense of your comments about Windows, but they don't seem relevant here.)

BurntSushi commented 3 years ago

Using heaptrack, about 75MB comes from adding sparse states, and about 42MB comes from the sparse transitions themselves. That also seems in line with my analysis above.

FH0 commented 3 years ago

I am not sure if you are waiting for my reply. Now it seems that only the question you mentioned is left.

BurntSushi / aho-corasick

heap_bytes is inaccurate #85