fkie-cad / fact_extractor

Standalone Utility for FACT-like extraction
GNU General Public License v3.0
81 stars 31 forks source link

Modify entropy calculation to use mmap instead of read_bytes() to reduce memory usage for large files #138

Open eclipsotic opened 5 months ago

eclipsotic commented 5 months ago

This is primarily relevant when extracting very large (e.g. 32GB+) firmware images.

There are other uses of read_bytes() and read() in the code that could be refactored to use mmap, but this is the only one that seemed strictly necessary. If you guys are interested in changing the other instances, I'm glad to look into it as part of this PR.

maringuu commented 1 month ago

Can you explain why this is an improvement? While in general I agree to you that memory mapping is a good idea (e.g. unblob does it this way) I do not see how this is an improvement here. Currently the file is read to memory, which is also what memory mapping does. How is this different?

jstucke commented 1 month ago

I think it would be even better to change avg_entropy() so that it takes e.g. a file pointer and only reads in chunks when calculating the entropy but it's part of https://github.com/fkie-cad/common_helper_unpacking_classifier/ so we would need to change it there first

eclipsotic commented 1 month ago

Can you explain why this is an improvement? While in general I agree to you that memory mapping is a good idea (e.g. unblob does it this way) I do not see how this is an improvement here. Currently the file is read to memory, which is also what memory mapping does. How is this different?

Memory mapping does not read the entire file into memory; that's the crucial difference. mmap uses lazy loading, causing overall RAM usage to be much, much lower than reading the full file into memory. As a result, switching to mmap greatly reduces the RAM requirements when running the extractor on large (e.g. 32GB+) firmware images.

eclipsotic commented 1 month ago

I think it would be even better to change avg_entropy() so that it takes e.g. a file pointer and only reads in chunks when calculating the entropy but it's part of https://github.com/fkie-cad/common_helper_unpacking_classifier/ so we would need to change it there first

This MR produces that behavior (i.e. only read in chunks as they are accessed). What you're describing is what mmap does, which is why I wanted to switch to it :smile:

jstucke commented 1 month ago

I think it would be even better to change avg_entropy() so that it takes e.g. a file pointer and only reads in chunks when calculating the entropy but it's part of https://github.com/fkie-cad/common_helper_unpacking_classifier/ so we would need to change it there first

This MR produces that behavior (i.e. only read in chunks as they are accessed). What you're describing is what mmap does, which is why I wanted to switch to it 😄

That does not seem to be entirely correct. I just tested all variants and mmap only uses less memory if you read only a chunk of the file, but if you read the entire file chunk by chunk, you will use the same amount of memory as with reading in the entire file up front (the only difference is that it gradually increases as the file is lazily read instead of using it all at once). Only reading in the file chunk-wise into the same buffer seems to really reduce the memory footprint.