an unofficial implementation of Yoon Kim's Convolutional Neural Networks for Sentence Classification with Chainer.
Abstract (from Cornell university library) We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.
datasets from cornell dataset
# data location
data/
|_pos/
| |_cv000_01.txt
| |_cv000_02.txt
| :
|_neg/
|_cv000_01.txt
|_cv000_02.txt
:
or get data like this.
cd data
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -O imdb.tar.gz
tar -xf imdb.tar.gz
data loading
import data_builder
data = data_builder.load_imdb_data()
data.get_info()
Data Info imdb
------------------------------
Vocab: 18744
Sentences: 10662
------------------------------
x_train: (5331, 1, 53)
x_test: (5331, 1, 53)
y_train: (5331,)
y_test: (5331,)
and train.
import cnnsc
clf = cnnsc.sample_train(data, model_type="CNN_rand")
results...
epoch elapsed_time main/loss validation/main/loss main/accuracy validation/main/accuracy
1 129.698 0.701306 0.692502 0.518555 0.497461
2 256.704 0.659536 0.689695 0.648438 0.550195
3 382.597 0.614391 0.688409 0.708333 0.532812
4 513.592 0.542008 0.687492 0.810547 0.566992
5 638.055 0.427749 0.695113 0.917969 0.516992
other models need word2vec embedding. exec embed()
before start training.
data.embed()
clf = cnnsc.sample_train(data, model_type="CNN_static")
# or
clf = cnnsc.sample_train(data, model_type="CNN_non_static")
# or
clf = cnnsc.sample_train(data, model_type="CNN_multi_ch")
when use other data
import cnnsc, data_builder
data = data_builder.Data("DATANAME", "LIST-OF-FILEPATH", "LABELS").load()
dataset = data.get_chainer_dataset()
clf = cnnsc.train(dataset=dataset)