Read file in chunks instead of reading it in memory at once?

wapsi commented 4 months ago

Pantastic seems to be a fantasic piece of software! But I benchmarked it a little bit, and noticed that if it's scanning some very large file(s) (like 10GB+) the Python process uses RAM as much as the file size is. Is that correct, or did I misanalyze it?

I think it would more efficient, if it reads the file in 128KB to 100MB chunks (maybe that could be even configurable parameter), instead of reading the whole file in RAM at once. Or if it does that chunk based reading already, maybe it should free up the used memory immediately after it has scanned that particular chunk (before reading the next one). Of course, in that case, some last bytes from the previous chunk need to remain in the RAM if the PAN is split between two chunks.

Centurix commented 4 months ago

Thanks for trying it out. I suspect you're correct in the benchmarking with it using as many resources as it can get its hands on. The case where I was using it involved a single server that sole job was scanning a network for PANs so resource monitoring wasn't a high priority at the time.

From memory, it uses mmap to handle file contents, so I'd say that chunk-reading the file would be a good option to reduce the memory usage without affecting performance too much. There may be some chunk edge overlap necessary for the digit grouping to work correctly, but I can't imagine that being a blocker.

Hmm, just looking at the source it does appear to be already chunking the file reads into 1Mb sections:

                    file_buffer = mm.read(1024**2)
                    if not file_buffer:
                        break

So it may be a case of the application not releasing memory. I'll take a look and see if there's gains somewhere.

wapsi commented 4 months ago

Ah, good point. Maybe it's about releasing the used memory then, like you suggested. I noticed that when it scans a large file, it increases the memory footprint while it's reading the file, and after it moves to the next file, it releases the used memory for the finished file.

Centurix / Pantastic

Read file in chunks instead of reading it in memory at once? #5