codeplea / ahocorasickphp

Aho-Corasick multi-keyword string searching library in PHP.
zlib License
182 stars 16 forks source link
aho-corasick ahocorasick algorithm php search-algorithm string-search

Aho Corasick in PHP

This is a small library which implements the Aho-Corasick string search algorithm.

It's coded in pure PHP and self-contained in a single file, ahocorasick.php.

It's useful when you want to search for many keywords all at once. It's faster than simply calling strpos many times, and it's much faster than calling preg_match_all with something like /keyword1|keyword2|...|keywordn/.

I originally wrote this to use with F5Bot, since it's searching for the same set of a few thousand keywords over and over again.

Usage

It's designed to be really easy to use. You create the ahocorasick object, add your keywords, call finalize() to finish setup, and then search your text. It'll return an array of the keywords found and their position in the search text.

Create, add keywords, and finalize():

require('ahocorasick.php');

$ac = new ahocorasick();

$ac->add_needle('art');
$ac->add_needle('cart');
$ac->add_needle('ted');

$ac->finalize();

Call search() to preform the actual search. It'll return an array of matches.

$found = $ac->search('a carted mart lot one blue ted');
print_r($found);

$found will be an array with these elements:

[0] => Array
    (
        [0] => cart
        [1] => 2
    )
[1] => Array
    (
        [0] => art
        [1] => 3
    )
[2] => Array
    (
        [0] => ted
        [1] => 5
    )
[3] => Array
    (
        [0] => art
        [1] => 10
    )
[4] => Array
    (
        [0] => ted
        [1] => 27
    )

See example.php for a complete example.

Speed

A simple benchmarking program is included which compares various alternatives.

$ php benchmark.php
Loaded 3000 keywords to search on a text of 19377 characters.

Searching with strpos...
time: 0.38440799713135

Searching with preg_match...
time: 5.6817619800568

Searching with preg_match_all...
time: 5.0735609531403

Searching with aho corasick...
time: 0.054709911346436

Note: the regex solutions are actually slightly broken. They won't work if you have a keyword that is a prefix or suffix of another. But hey, who really uses regex when it's not slightly broken?

Also keep in mind that building the search tree (the add_needle() and finalize() calls) takes time. So you'll get the best speed-up if you're reusing the same keywords and calling search() many times.