A lexicon is a data-structure which stores a set of words. The difference between a dictionary and a lexicon is that in a lexicon there are no values associated with the words.
A lexicon is similar to a list or a set of words, but the internal representation is different and optimized for faster searches of words, prefixes and wildcard patterns.
Given a word, precisely, the search time is O(W) where W is the length of the word.
2 important lexicon data-structures are Trie and Directed Acyclic Word Graph (DAWG).
lexpy
can be installed via Python Package Index (PyPI)
using pip
. The only installation requirement is that you need Python 3.7 or higher.
pip install lexpy
Interface Description | Trie | DAWG |
---|---|---|
Add a single word | add('apple', count=2) |
add('apple', count=2) |
Add multiple words | add_all(['advantage', 'courage']) |
add_all(['advantage', 'courage']) |
Check if exists? | in operator |
in operator |
Search using wildcard expression | search('a?b*', with_count=True) |
search('a?b*, with_count=True) |
Search for prefix matches | search_with_prefix('bar', with_count=True) |
search_with_prefix('bar') |
Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance | search_within_distance('apble', dist=1, with_count=True) |
search_within_distance('apble', dist=1, with_count=True) |
Get the number of nodes in the automaton | len(trie) |
len(dawg) |
from lexpy import Trie
trie = Trie()
input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba',
'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin',
'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit',
'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small'
'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']
trie.add_all(input_words) # You can pass any sequence types or a file-like object here
print(trie.get_word_count())
>>> 48
In the file, words should be newline separated.
from lexpy import Trie
# Either
trie = Trie()
trie.add_all('/path/to/file.txt')
# Or
with open('/path/to/file.txt', 'r') as infile:
trie.add_all(infile)
in
operatorprint('ampyx' in trie)
>>> True
print(trie.search_with_prefix('ab'))
>>> ['abhor', 'abuzz']
print(trie.search_with_prefix('ab', with_count=True))
>>> [('abuzz', 2), ('abhor', 1)]
?
and *
?
= 0 or 1 occurrence of any character
*
= 0 or more occurrence of any character
print(trie.search('a*o*'))
>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']
print(trie.search('a*o*', with_count=True))
>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]
print(trie.search('su?t'))
>>> ['suit']
print(trie.search('su?t', with_count=True))
>>> [('suit', 1)]
print(trie.search_within_distance('arie', dist=2))
>>> ['athie', 'arbil', 'auric']
print(trie.search_within_distance('arie', dist=2, with_count=True))
>>> [('athie', 3), ('arbil', 1), ('auric', 1)]
trie.add('athie', count=1000)
print(trie.search_within_distance('arie', dist=2, with_count=True))
>>> [('athie', 1003), ('arbil', 1), ('auric', 1)]
DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is always less than or equal to the number of nodes in Trie.
They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA.
In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.
In the current implementation of DAWG, the insertion order of the words should be alphabetical.
The implementation idea of DAWG is borrowed from http://stevehanov.ca/blog/?id=115
from lexpy import Trie, DAWG
trie = Trie()
trie.add_all(['advantageous', 'courageous'])
dawg = DAWG()
dawg.add_all(['advantageous', 'courageous'])
len(trie) # Number of Nodes in Trie
23
dawg.reduce() # Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.
len(dawg) # Number of nodes in DAWG
21
The APIs are exactly same as the Trie APIs
from lexpy import DAWG
dawg = DAWG()
input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba',
'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin',
'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit',
'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small'
'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']
dawg.add_all(input_words)
dawg.reduce()
dawg.get_word_count()
>>> 48
in
operatorprint('ampyx' in dawg)
>>> True
print(dawg.search_with_prefix('ab'))
>>> ['abhor', 'abuzz']
print(dawg.search_with_prefix('ab', with_count=True))
>>> [('abuzz', 2), ('abhor', 1)]
?
and *
?
= 0 or 1 occurance of any character
*
= 0 or more occurance of any character
print(dawg.search('a*o*'))
>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']
print(dawg.search('a*o*', with_count=True))
>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]
print(dawg.search('su?t'))
>>> ['suit']
print(dawg.search('su?t', with_count=True))
>>> [('suit', 1)]
print(dawg.search_within_distance('arie', dist=2))
>>> ['athie', 'arbil', 'auric']
print(dawg.search_within_distance('arie', dist=2, with_count=True))
>>> [('athie', 3), ('arbil', 1), ('auric', 1)]
If you insert a word which is lexicographically out-of-order, ValueError
will be raised.
dawg.add('athie', count=1000)
ValueError
ValueError: Words should be inserted in Alphabetical order. <Previous word - thrill>, <Current word - athie>
dawg.add_all(['thrill']*20000) # or dawg.add('thrill', count=20000)
print(dawg.search('thrill', with_count=True))
>> [('thrill', 20001)]
Special characters, except ?
and *
, are matched literally.
from lexpy import Trie
t = Trie()
t.add('a©')
t.search('a©')
>> ['a©']
t.search('a?')
>> ['a©']
t.search('?©')
>> ['a©']
These are some ideas which I would love to work on next in that order. Pull requests or discussions are invited.