WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

How do I count the number of occurrences of selected strings in a corpus? #118

Closed tommycarstensen closed 2 years ago

tommycarstensen commented 4 years ago

How can I use pyahocorasick to count the number of occurrences of selected strings in a corpus? Thanks!

WojciechMula commented 4 years ago

I'm not sure what you want to achieve. Let me know if the following example matches your use case.

import ahocorasick

strings = "cat dog kitten fox".split()
corpus = [
    "the fox chases the dog",
    "a kitten is a little, cute cat",
    "dogs bark for no reason",
]

A = ahocorasick.Automaton()
for word in strings:
    A.add_word(word, True)
else:
    A.make_automaton()

total = 0
for string in corpus:
    for _ in A.iter(string):
        total += 1

print("words found in the corpus: %s" % total)
pombredanne commented 2 years ago

@tommycarstensen I am closing this for now ... I hope @WojciechMula was satisfying. Please reopen or post a new issue if you have another question. Thanks!