SLU-TMI / TextMining.jl

Other
24 stars 7 forks source link

TextMining.jl

Build Status

This package is a set of tools being used by Saint Louis University to facilitate interdisciplinary research of how time passage affects language using data mining, machine learning, and natural language processing techniques.

For further information, contact project leader Lauren Kersey at kersey@slu.edu.

Julia Logo

These tools are being developed in Julia with the goal of making them fast, generic, and easily usable in Julia's REPL. Installing the TextMining package is done like so:

Pkg.clone("https://github.com/SLU-TMI/TextMining.jl.git")

Table of Contents

  1. Feature Space Model
    1. Feature Vector
    2. Cluster
    3. DataSet
  2. Clustering
    1. k Means
    2. Hierarchical
  3. Classification
    1. Proximity Based Classification
    2. k Nearest Neighbors
    3. Probability Based Classification
    4. Distribution
    5. Naive Bayes
  4. Text Processing

Feature Space Model

These tools will utilize the bag-of-words model and the hashing trick to vectorize texts into feature vectors. Feature vectors exist in an infinite dimensional vector space which is refered to as the feature space. In order to optimize calculations, dimensions where the feature vector has value 0 are removed from the feature vector's hashtable. We are defining FeatureSpace to be an abstract type which has 3 subtypes: FeatureVector, Cluster, and DataSet.

Feature Vector

The FeatureVector type is a container for a Dictionary (hashtable) that restricts key => value mappings to Any => Number mappings, where Any and Number are Julia types, or their subtypes.

Using FeatureVector:

Constructing an empty FeatureVector:

julia> fv = FeatureVector()

A FeatureVector can also be constructed using an existing Dictionary:

julia> dict = ["word1"=>2, "word2"=>4]
julia> fv = FeatureVector(dict)

or

julia> fv = FeatureVector(["word1"=>2, "word2"=>4])

Modifying elements of a FeatureVector:

julia> fv["word1"] = 4

Accessing elements of a FeatureVector:

julia> fv["word1"]
4

Addition and subtraction between two FeatureVectors:

julia> fv1 = FeatureVector(["word1" => 2, "word2" => 4])
julia> fv2 = FeatureVector(["word1" => 4, "word2" => 2])
julia> fv1+fv2
FeatureVector{ASCIIString,Int64}(["word1"=>6,"word2"=>6])

julia> fv1-fv2
FeatureVector{ASCIIString,Int64}(["word1"=>-2,"word2"=>2])

Multiplication and division by a scalar:

julia> fv = FeatureVector(["word1" => 1, "word2" => 3])
julia> fv*3
FeatureVector{ASCIIString,Int64}(["word1"=>3,"word2"=>9])

julia> fv/3
FeatureVector{Any,Float64}(["word1"=>0.3333333333333333,"word2"=>1.0])

If a FeatureVector contains Integer value types it can be rationalized by a divisor:

julia> fv//3
FeatureVector{ASCIIString,Rational{T<:Integer}}(["word1"=>1//3,"word2"=>1//1])

but returns an error otherwise:

julia> fv = FeatureVector(["word1" => 1.0, "word2" => 3.0])
julia> fv//3
ERROR: `//` has no method matching //(::Float64, ::Int64)

FeatureVector Functions:
keys(fv)

Cluster

The Cluster type is also a Dictionary container. However, it restricts mappings to Any => FeatureVector types and subtypes. This allows users to meaningfully label groups of FeatureVectors for Classification. The Cluster type also computes the centroid of the set.

An empty Cluster can be constructed as so:

cl = Cluster()

DataSet

The DataSet type is also a wrapper around a Dictionary. However, it restricts mappings to Any => Cluster types and subtypes.

An empty DataSet can be constructed as so:

ds = DataSet()

Clustering

k Means

Hierarchical


Classification

Proximity Based Classification


k Nearest Neighbors


Probability Based Classification


Distribution

The Distribution type is a container which ensures the axioms of probability.

An empty Distribution can be constructed as so:

ds = Distribution()

Naive Bayes


Text Processing

Processing XML Files