TextMining.jl

This package is a set of tools being used by Saint Louis University to facilitate interdisciplinary research of how time passage affects language using data mining, machine learning, and natural language processing techniques.

For further information, contact project leader Lauren Kersey at kersey@slu.edu.

These tools are being developed in Julia with the goal of making them fast, generic, and easily usable in Julia's REPL. Installing the TextMining package is done like so:

Pkg.clone("https://github.com/SLU-TMI/TextMining.jl.git")

Feature Space Model
1. Feature Vector
2. Cluster
3. DataSet
Clustering
1. k Means
2. Hierarchical
Classification
1. Proximity Based Classification
2. k Nearest Neighbors
3. Probability Based Classification
4. Distribution
5. Naive Bayes
Text Processing

Feature Space Model

These tools will utilize the bag-of-words model and the hashing trick to vectorize texts into feature vectors. Feature vectors exist in an infinite dimensional vector space which is refered to as the feature space. In order to optimize calculations, dimensions where the feature vector has value 0 are removed from the feature vector's hashtable. We are defining FeatureSpace to be an abstract type which has 3 subtypes: FeatureVector, Cluster, and DataSet.

Feature Vector

The FeatureVector type is a container for a Dictionary (hashtable) that restricts key => value mappings to Any => Number mappings, where Any and Number are Julia types, or their subtypes.

Using FeatureVector:

Constructing an empty FeatureVector:

julia> fv = FeatureVector()

A FeatureVector can also be constructed using an existing Dictionary:

julia> dict = ["word1"=>2, "word2"=>4]
julia> fv = FeatureVector(dict)

julia> fv = FeatureVector(["word1"=>2, "word2"=>4])

Modifying elements of a FeatureVector:

julia> fv["word1"] = 4

Accessing elements of a FeatureVector:

julia> fv["word1"]
4

Addition and subtraction between two FeatureVectors:

julia> fv1 = FeatureVector(["word1" => 2, "word2" => 4])
julia> fv2 = FeatureVector(["word1" => 4, "word2" => 2])
julia> fv1+fv2
FeatureVector{ASCIIString,Int64}(["word1"=>6,"word2"=>6])

julia> fv1-fv2
FeatureVector{ASCIIString,Int64}(["word1"=>-2,"word2"=>2])

Multiplication and division by a scalar:

julia> fv = FeatureVector(["word1" => 1, "word2" => 3])
julia> fv*3
FeatureVector{ASCIIString,Int64}(["word1"=>3,"word2"=>9])

julia> fv/3
FeatureVector{Any,Float64}(["word1"=>0.3333333333333333,"word2"=>1.0])

If a FeatureVector contains Integer value types it can be rationalized by a divisor:

julia> fv//3
FeatureVector{ASCIIString,Rational{T<:Integer}}(["word1"=>1//3,"word2"=>1//1])

but returns an error otherwise:

julia> fv = FeatureVector(["word1" => 1.0, "word2" => 3.0])
julia> fv//3
ERROR: `//` has no method matching //(::Float64, ::Int64)

FeatureVector Functions:

keys(fv)

Returns an Iterator to the keys in fv.
```
values(fv)
```
Returns an Iterator to the values in fv.
```
haskey(fv, key)
```
Checks fv for key and returns true if found or false if not present.
```
isempty(fv)
```
Returns false if fv contains any elements, true otherwise.
```
length(fv)
```

Returns the number of elements in fv.

freq_list(fv, expression = (a,b) -> a[2]>b[2])

Returns a frequency list represented by an Array of (key,value) tuples sorted using the provided boolean expression. If an expression is not passed in, the Array will be sorted by largest value.
```
add!(fv1, fv2)
```
In place addition. Modifies fv1 by adding fv2 to it.
```
subtract!(fv1, fv2)
```
In place subtraction. Modifies fv1 by subtracting fv2 from it.
```
multiply!(fv, value)
```
In place multiplication. Scales fv by value.
```
divide!(fv, value)
```
In place division. Divides fv by value.
```
rationalize!(fv, value)
```
In place rationalization. Rationalizes fv by value.
```
dist_cos(fv1, fv2)
```
Returns 1-cosine similarity between two feature vectors. If the angle between fv1 and fv2 is 0, the function will return 0. If fv1 and fv2 are orthogonal, meaning they share no features, the function will return 1. Otherwise the function returns values between 0 and 1. Note: The zero vector is both parallel and orthogonal to every vector, as such cos_dist(fv, zero_vector) will return NaN (not a number).
```
dist_zero(fv1, fv2)
```
Derived from the L0 Norm, this function returns the number of non-zero elements that differ between fv1 and fv2.
```
dist_taxicab(fv1, fv2)
```
Derived from the L1 Norm and also know as the Manhattan distance, this function returns the sum of the absolute difference between fv1 and fv2.
```
dist_euclidean(fv1, fv2)
```
Returns the standard distance between fv1 and fv2.
```
dist_infinite(fv1, fv2)
```
Derived from the L∞ Norm and often referd to as the Chebyshev distance, this function returns the maximum absolute difference between any feature in fv1 or fv2.

Cluster

The Cluster type is also a Dictionary container. However, it restricts mappings to Any => FeatureVector types and subtypes. This allows users to meaningfully label groups of FeatureVectors for Classification. The Cluster type also computes the centroid of the set.

An empty Cluster can be constructed as so:

cl = Cluster()

DataSet

The DataSet type is also a wrapper around a Dictionary. However, it restricts mappings to Any => Cluster types and subtypes.

An empty DataSet can be constructed as so:

ds = DataSet()

Clustering

k Means

Hierarchical

Classification

Proximity Based Classification

k Nearest Neighbors

Probability Based Classification

Distribution

The Distribution type is a container which ensures the axioms of probability.

An empty Distribution can be constructed as so:

ds = Distribution()

SLU-TMI / TextMining.jl

readme