The dictosaurus
package provides natural language processing (NLP)
utilities used in information retrieval systems
. It includes dictionary, thesaurus and term expansion utilities and is intended for information retrieval system
applications.
Refer to the references to learn more about information retrieval systems
.
In the pubspec.yaml
of your flutter project, add the following dependency:
dependencies:
dictosaurus: <latest_version>
In your code file add the following import:
// import the core interfaces, classes and mixins of the `dictosaurus` library
import 'package:dictosaurus/dictosaurus.dart';
// import the typedefs library to use types defined in the `dictosaurus` package.
import 'package:dictosaurus/type_definitions.dart';
Use of the Dictosaurus is demonstrated below.
// define a term with incorrect spelling.
final misspeltTerm = 'appel';
// define a correctly spelled term.
final term = 'swim';
// get a Dictosaurus instance from an implementation class (not shown here)
final dictoSaurus = await getDictoSaurus();
// get spelling correction suggestions
final corrections = await dictoSaurus.suggestionsFor(misspeltTerm, 5);
// expand the term
final expansions = await dictoSaurus.expandTerm(term, 5);
// get a dictionary entry properties
final entry = await dictoSaurus.getEntry(term);
// get the defintions
final definitions = entry.synonymsOf();
// get the synonyms when used as a verb
final synonyms = entry.synonymsOf(PartOfSpeech.verb);
// get the antonyms
final antonyms = entry.antonymsOf();
// get the inflections
final inflections = entry.inflectionsOf();
// get the phrases
final phrases = entry.phrasesWith();
Please refer to the API documentation.
The DictionaryEntry interface is an object model for a term or word with immutable properties (term, stem, lemma, language). The DictionaryEntry interface also enumerates variants of the term with different values for part-of-speech, definition, etymology, pronunciation, synonyms, antonyms and inflections, each with one or more example phrases.
Three interfaces provide dictionary, thesaurus and term expansion functions:
term
, or a translation of a term
; andterm
or terms that start with the same characters.The DictoSaurus interface implements the Dictionary and AutoCorrect interfaces.
The DictoSaurus interface also exposes the expandTerm method that performs term-expansion
, returning a list of terms in descending order of relevance (best match first). The (expanded) list of terms includes the term
, its synonyms
(if any) and spelling correction suggestions.
We use an interface > implementation mixin > base-class > implementation class pattern:
interface
is an abstract class that exposes fields and methods but contains no implementation code. The interface
may expose a factory constructor that returns an implementation class
instance;implementation mixin
implements the interface
class methods, but not the input fields; andbase-class
is an abstract class with the implementation mixin
and exposes a default, unnamed generative const constructor for sub-classes. The intention is that implementation classes
extend the base class
, overriding the interface
input fields with final properties passed in via a const generative constructor.
The class naming convention for this pattern is "Interface" > "InterfaceMixin" > "InterfaceBase".The following definitions are used throughout the documentation:
corpus
- the collection of documents
for which an index
is maintained.character filter
- filters characters from text in preparation of tokenization. Damerau–Levenshtein distance
- a metric for measuring the edit distance
between two terms
by counting the minimum number of operations (insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one term
into the other (from Wikipedia).dictionary (in an index)
- a hash of terms
(vocabulary
) to the frequency of occurence in the corpus
documents.document
- a record in the corpus
, that has a unique identifier (docId
) in the corpus
's primary key and that contains one or more text fields that are indexed.document frequency (dFt)
- the number of documents in the corpus
that contain a term.edit distance
- a measure of how dissimilar two terms are by counting the minimum number of operations required to transform one string into the other (from Wikipedia).etymology
- the study of the history of the form of words and, by extension, the origin and evolution of their semantic meaning across time (from Wikipedia).Flesch reading ease score
- a readibility measure calculated from sentence length and word length on a 100-point scale. The higher the score, the easier it is to understand the document (from Wikipedia).Flesch-Kincaid grade level
- a readibility measure relative to U.S. school grade level. It is also calculated from sentence length and word length (from Wikipedia).IETF language tag
- a standardized code or tag that is used to identify human languages in the Internet. (from Wikepedia).index
- an inverted index used to look up document
references from the corpus
against a vocabulary
of terms
. index-elimination
- selecting a subset of the entries in an index where the term
is in the collection of terms
in a search phrase.inverse document frequency (iDft)
- a normalized measure of how rare a term
is in the corpus. It is defined as log (N / dft)
, where N is the total number of terms in the index. The iDft
of a rare term is high, whereas the iDft
of a frequent term is likely to be low.Jaccard index
measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets (from Wikipedia).Map<String, dynamic>
is an acronym for "Java Script Object Notation"
, a common format for persisting data.k-gram
- a sequence of (any) k consecutive characters from a term
. A k-gram
can start with "$", denoting the start of the term, and end with "$", denoting the end of the term. The 3-grams for "castle" are { $ca, cas, ast, stl, tle, le$ }.lemma or lemmatizer
- lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from Wikipedia).Natural language processing (NLP)
is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data (from Wikipedia).Part-of-Speech (PoS) tagging
is the task of labelling every word in a sequence of words with a tag indicating what lexical syntactic category it assumes in the given sequence (from Wikipedia).Phonetic transcription
- the visual representation of speech sounds (or phones) by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International Phonetic Alphabet (from Wikipedia).postings
- a separate index that records which documents
the vocabulary
occurs in. In a positional index
, the postings also records the positions of each term
in the text
to create a positional inverted index
.postings list
- a record of the positions of a term
in a document
. A position of a term
refers to the index of the term
in an array that contains all the terms
in the text
. In a zoned index
, the postings lists
records the positions of each term
in the text
a zone
.stem or stemmer
- stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (generally a written word form) (from Wikipedia).stopwords
- common words in a language that are excluded from indexing.term
- a word or phrase that is indexed from the corpus
. The term
may differ from the actual word used in the corpus depending on the tokenizer
used.term filter
- filters unwanted terms from a collection of terms (e.g. stopwords), breaks compound terms into separate terms and / or manipulates terms by invoking a stemmer
and / or lemmatizer
.term expansion
- finding terms with similar spelling (e.g. spelling correction) or synonyms for a term. term frequency (Ft)
- the frequency of a term
in an index or indexed object.term position
- the zero-based index of a term
in an ordered array of terms
tokenized from the corpus
.text
- the indexable content of a document
.token
- representation of a term
in a text source returned by a tokenizer
. The token may include information about the term
such as its position(s) (term position
) in the text or frequency of occurrence (term frequency
).token filter
- returns a subset of tokens
from the tokenizer output.tokenizer
- a function that returns a collection of token
s from text
, after applying a character filter, term
filter, stemmer and / or lemmatizer.vocabulary
- the collection of terms
indexed from the corpus
.zone
- the field or zone of a document that a term occurs in, used for parametric indexes or where scoring and ranking of search results attribute a higher score to documents that contain a term in a specific zone (e.g. the title rather that the body of a document).If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don't respond immediately to issues or pull requests.