These tools are being developed in Julia with the goal of making them fast, generic, and easily usable in Julia's REPL. Installing the TextMining package is done like so:
Pkg.clone("https://github.com/SLU-TMI/TextMining.jl.git")
These tools will utilize the bag-of-words model and the hashing trick to vectorize texts into feature vectors. Feature vectors exist in an infinite dimensional vector space which is refered to as the feature space. In order to optimize calculations, dimensions where the feature vector has value 0 are removed from the feature vector's hashtable. We are defining FeatureSpace to be an abstract type which has 3 subtypes: FeatureVector, Cluster, and DataSet.
The FeatureVector type is a container for a Dictionary (hashtable) that restricts key => value mappings to Any => Number mappings, where Any and Number are Julia types, or their subtypes.
Constructing an empty FeatureVector:
julia> fv = FeatureVector()
A FeatureVector can also be constructed using an existing Dictionary:
julia> dict = ["word1"=>2, "word2"=>4]
julia> fv = FeatureVector(dict)
or
julia> fv = FeatureVector(["word1"=>2, "word2"=>4])
Modifying elements of a FeatureVector:
julia> fv["word1"] = 4
Accessing elements of a FeatureVector:
julia> fv["word1"]
4
Addition and subtraction between two FeatureVectors:
julia> fv1 = FeatureVector(["word1" => 2, "word2" => 4])
julia> fv2 = FeatureVector(["word1" => 4, "word2" => 2])
julia> fv1+fv2
FeatureVector{ASCIIString,Int64}(["word1"=>6,"word2"=>6])
julia> fv1-fv2
FeatureVector{ASCIIString,Int64}(["word1"=>-2,"word2"=>2])
Multiplication and division by a scalar:
julia> fv = FeatureVector(["word1" => 1, "word2" => 3])
julia> fv*3
FeatureVector{ASCIIString,Int64}(["word1"=>3,"word2"=>9])
julia> fv/3
FeatureVector{Any,Float64}(["word1"=>0.3333333333333333,"word2"=>1.0])
If a FeatureVector contains Integer value types it can be rationalized by a divisor:
julia> fv//3
FeatureVector{ASCIIString,Rational{T<:Integer}}(["word1"=>1//3,"word2"=>1//1])
but returns an error otherwise:
julia> fv = FeatureVector(["word1" => 1.0, "word2" => 3.0])
julia> fv//3
ERROR: `//` has no method matching //(::Float64, ::Int64)
keys(fv)
values(fv)
haskey(fv, key)
isempty(fv)
length(fv)
freq_list(fv, expression = (a,b) -> a[2]>b[2])
add!(fv1, fv2)
subtract!(fv1, fv2)
multiply!(fv, value)
divide!(fv, value)
rationalize!(fv, value)
dist_cos(fv1, fv2)
dist_zero(fv1, fv2)
dist_taxicab(fv1, fv2)
dist_euclidean(fv1, fv2)
dist_infinite(fv1, fv2)
The Cluster type is also a Dictionary container. However, it restricts mappings to Any => FeatureVector types and subtypes. This allows users to meaningfully label groups of FeatureVectors for Classification. The Cluster type also computes the centroid of the set.
An empty Cluster can be constructed as so:
cl = Cluster()
The DataSet type is also a wrapper around a Dictionary. However, it restricts mappings to Any => Cluster types and subtypes.
An empty DataSet can be constructed as so:
ds = DataSet()
The Distribution type is a container which ensures the axioms of probability.
An empty Distribution can be constructed as so:
ds = Distribution()