WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
929 stars 122 forks source link

Search through an mmap()-ed file? #46

Closed kefir- closed 5 years ago

kefir- commented 7 years ago

Hi,

I'm trying to use this module to search through a file that I've mmap()-ed, following one of the examples. So roughly something like this:

f = open(filename, "rb")
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
A = ahocorasick.Automaton()
A.add_word("foo")
A.add_word("bar\x00")
A.add_word("baz")
A.make_automaton()
for item in A.iter(m):
    print(item)

I get this error:

Traceback (most recent call last):
  File "./ac.py", line 15, in func
    for item in A.iter(m):
TypeError: string required

Is this possible? I was hoping to search through large amounts of "unclean" input, including searching for binary values like "bar\x00".

WojciechMula commented 7 years ago

Hi, you have to get string from mmap object, for example using method m.read() which fetch portion of data into string. Fortunately you don't need to read the whole file, please take look at examples from section "AutomatonSearchIter class" (http://pyahocorasick.readthedocs.io/en/latest/#automatonsearchiter-class).

kefir- commented 7 years ago

Thanks for the fast reply! :-)

I should perhaps have added that I'm hoping to process a huge file in one pass without doing too much extra work. In this case I'd like to process a 500GB file and find all instances of a number of binary strings. I can't expect the data to be valid in any particular character encoding, so I'm kind of skeptical to converting the data to strings at all. Also, using read() seems to open a new can of worms regarding potential substrings across the boundary of two read()s.

WojciechMula commented 7 years ago

I don't know mmap API, maybe there is a method which can return raw bytes. The module can't deal with mmap.

But you don't need to worry about strings which could cross read boundaries, as Aho-Corasick automaton is really smart and will report all occurrences of your strings. :)

kefir- commented 7 years ago

Oh right. :) mmap is great, because you can use it to read a file as if it were a buffer, and if you have complex reading patterns (which I don't, I'm just scanning sequentially) the OS handles buffering and paging in/out as efficiently as it can. In my case the buffer is read only, but that should be fine. The OS does any reading and buffering necessary, so it's a brilliant way to scan through a large file without doing lots of read()s. So it ought to be treated just like a buffer.

So in practice, m[0] is the first byte of the file, m[0:10] is the first 10 bytes, etc. I'd think it should work just like a buffer for our purpose? Quick example:

$ cat testmmap.py 
#!/usr/bin/env python3
import mmap
import os
f = open("mmap.file", "rb")
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
print("m[0] = ", repr(m[0]))
print("m[0:10] = ", repr(m[0:10]))
m.close()
f.close()

$ dd if=/dev/zero of=mmap.file bs=1k count=1
1+0 records in
1+0 records out
1024 bytes (1,0 kB, 1,0 KiB) copied, 0,000278331 s, 3,7 MB/s

$ ./testmmap.py 
m[0] =  0
m[0:10] =  b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
WojciechMula commented 6 years ago

The problem with the current implementation is the way it handles input data. We access internal structures of strings and basically the module deals with bare pointers. Using array facades, like mmap, would require some internal buffering and an another layer for the input stage.

To be honest I'm not keen on making C code more complex, especially when Python code is sufficient -- but of course any PRs are warmly welcomed. :)