aosingh / lexpy

Python package for lexicon; Trie and DAWG implementation.
GNU General Public License v3.0
55 stars 7 forks source link
dawg directed-acyclic-word-graph graph lexicon suffix-tree trie

Lexpy

lexpy Downloads PyPI version

Python 3.7 Python 3.8 Python 3.9 Python 3.10 Python 3.11 Python 3.12

PyPy3.7 PyPy3.8 PyPy3.9

Install

lexpy can be installed via Python Package Index (PyPI) using pip. The only installation requirement is that you need Python 3.7 or higher.

pip install lexpy

Interface

Interface Description Trie DAWG
Add a single word add('apple', count=2) add('apple', count=2)
Add multiple words add_all(['advantage', 'courage']) add_all(['advantage', 'courage'])
Check if exists? in operator in operator
Search using wildcard expression search('a?b*', with_count=True) search('a?b*, with_count=True)
Search for prefix matches search_with_prefix('bar', with_count=True) search_with_prefix('bar')
Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance search_within_distance('apble', dist=1, with_count=True) search_within_distance('apble', dist=1, with_count=True)
Get the number of nodes in the automaton len(trie) len(dawg)

Examples

Trie

Build from an input list, set, or tuple of words.

from lexpy import Trie

trie = Trie()

input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba', 
               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', 
               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', 
               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small' 
               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']

trie.add_all(input_words) # You can pass any sequence types or a file-like object here

print(trie.get_word_count())

>>> 48

Build from a file or file path.

In the file, words should be newline separated.


from lexpy import Trie

# Either
trie = Trie()
trie.add_all('/path/to/file.txt')

# Or
with open('/path/to/file.txt', 'r') as infile:
     trie.add_all(infile)

Check if exists using the in operator

print('ampyx' in trie)

>>> True

Prefix search

print(trie.search_with_prefix('ab'))

>>> ['abhor', 'abuzz']

print(trie.search_with_prefix('ab', with_count=True))

>>> [('abuzz', 2), ('abhor', 1)]

Wildcard search using ? and *

print(trie.search('a*o*'))

>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']

print(trie.search('a*o*', with_count=True))

>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]

print(trie.search('su?t'))

>>> ['suit']

print(trie.search('su?t', with_count=True))

>>> [('suit', 1)]

Search for similar words using the notion of Levenshtein distance

print(trie.search_within_distance('arie', dist=2))

>>> ['athie', 'arbil', 'auric']

print(trie.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 3), ('arbil', 1), ('auric', 1)]

Increment word count


trie.add('athie', count=1000)

print(trie.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 1003), ('arbil', 1), ('auric', 1)]

Directed Acyclic Word Graph (DAWG)

from lexpy import Trie, DAWG

trie = Trie()
trie.add_all(['advantageous', 'courageous'])

dawg = DAWG()
dawg.add_all(['advantageous', 'courageous'])

len(trie) # Number of Nodes in Trie
23

dawg.reduce() # Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.

len(dawg) # Number of nodes in DAWG
21

DAWG

The APIs are exactly same as the Trie APIs

Build a DAWG

from lexpy import DAWG
dawg = DAWG()

input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba', 
               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', 
               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', 
               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small' 
               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']

dawg.add_all(input_words)
dawg.reduce()

dawg.get_word_count()

>>> 48

Check if exists using the in operator

print('ampyx' in dawg)

>>> True

Prefix search

print(dawg.search_with_prefix('ab'))

>>> ['abhor', 'abuzz']

print(dawg.search_with_prefix('ab', with_count=True))

>>> [('abuzz', 2), ('abhor', 1)]

Wildcard search using ? and *

? = 0 or 1 occurance of any character

* = 0 or more occurance of any character

print(dawg.search('a*o*'))

>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']

print(dawg.search('a*o*', with_count=True))

>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]

print(dawg.search('su?t'))

>>> ['suit']

print(dawg.search('su?t', with_count=True))

>>> [('suit', 1)]

Search for similar words using the notion of Levenshtein distance

print(dawg.search_within_distance('arie', dist=2))

>>> ['athie', 'arbil', 'auric']

print(dawg.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 3), ('arbil', 1), ('auric', 1)]

Alphabetical order insertion

If you insert a word which is lexicographically out-of-order, ValueError will be raised.

dawg.add('athie', count=1000)

ValueError

ValueError: Words should be inserted in Alphabetical order. <Previous word - thrill>, <Current word - athie>

Increment the word count


dawg.add_all(['thrill']*20000) # or dawg.add('thrill', count=20000)

print(dawg.search('thrill', with_count=True))

>> [('thrill', 20001)]

Special Characters

Special characters, except ? and *, are matched literally.

from lexpy import Trie
t = Trie()
t.add('a©')
t.search('a©')
>> ['a©']
t.search('a?')
>> ['a©']
t.search('?©')
>> ['a©']

Trie vs DAWG

Number of nodes comparison

Build time comparison

Future Work

These are some ideas which I would love to work on next in that order. Pull requests or discussions are invited.

Fun Facts

  1. The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary. So for all english words, the search time is bounded by O(45).
  2. The longest technical word(not in dictionary) is the name of a protein called as titin. It has 189,819 letters and it is disputed whether it is a word.