Tilana / Classification

3 stars 1 forks source link

Classification

This module provides a framework to process, analyse, and categorize collections of documents.

Main Scripts

semantic-search.py - computes semantic similarity of a search term and sentences in a database.

compareDataframes.py - shows the classification differences (shifts from FN to TP, etc.) of two semantic search models.

explore_fasttext.py - change the preprocessing steps (removing stopwords, stemming, splitting sentence in half, etc.) of specified sample sentences to see their effect on semantic similarity

explore_corpus.py - get an overview about the number of (unique) words in a document collection. Also, get frequency of specific terms and randomly select context phrases.

explore_we_model.py - visualize vector representations of words with heatmaps, show the influence of averaging and a principal component analysis

Testing

The folder Unittests contains the tests corresponding to each module. nose provides an easy way to run all tests together.

Run the tests with:

nosetests Unittests/

Install dependencies

The code is based on different modules for machine learning and natural language processing, as well as other python libraries. To install them make sure you have Python 2.7 and pip installed.

Upgrade pip:

pip install -U

Install the dependencies with:

pip install --user -r requirements.txt

Installing NLTK data

Install nltk and to download the required modules, open python and type the following commands:

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

Installing FLASK framework

Install flask and run the application

pip install Flask
FLASK_APP=routes.py flask run --port 4000

Download pre-trained Word2Vec model

The word embeddings used for sentence classification with a convolutional neural network can either be trained on the specific collection or implemented by using a pre-trained model. Google provides such pre-trained word embeddings which are trained on parts of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The archive (1.5 GB) is here available. To make use of it, just download the file and unpack it in the Word2Vec folder in the main directory.

FastText also provides pretrained word embeddings for different languages which can be found here. As fasttext is trained on character n-grams it is possible to provide vectors of words that were not included in the training data. To load this model in an efficient way a daemon process build on Pyro is used. Start the daemon process with:

python lda/WordEmbedding.py

Then run the respective script, e.g.:

python onlineLearning.py

Detailed list of dependencies