castorini / MP-CNN-Torch

Multi-Perspective Convolutional Neural Networks for modeling textual similarity (He et al., EMNLP 2015)
107 stars 58 forks source link
convolutional-neural-networks deep-learning similarity-measurement

Multi-Perspective Convolutional Neural Networks for Modeling Textual Similarity

NOTE: This repo contains code for the original Torch implementation from the EMNLP 2015 paper. The code is not being maintained anymore and has been superseded by a PyTorch reimplementation in Castor. This repo exists solely for archival purposes.

This repo contains the Torch implementation of multi-perspective convolutional neural networks for modeling textual similarity, described in the following paper:

This model does not require external resources such as WordNet or parsers, does not use sparse features, and achieves good accuracy on standard public datasets.

Installation and Dependencies

Running

The tool will output pearson scores and also write the predicted similarity scores given each pair of sentences from test data into predictions directory.

Adaption to New Dataset

To run our model on your own dataset, first you need to build the dataset following below format and put it under data folder:

Then build vocabulary for your dataset which writes the vocab-cased.txt into your data folder:

$ python build_vocab.py

The last thing is to change the training and model code slightly to process your dataset:

Then you should be able to run your training code.

Trained Model

We also porvide a model which is already trained on STS dataset. So it is easier if you just want to use the model and do not want to re-train the whole thing.

The tarined model download link is HERE. Model file size is 500MB. To use the trained model, then simply use codes below:

modelTrained = torch.load("download_local_location/modelSTS.trained.th", 'ascii')
modelTrained.convModel:evaluate()
modelTrained.softMaxC:evaluate()
local linputs = torch.zeros(rigth_sentence_length, emd_dimension)
linpus = XassignEmbeddingValuesX
local rinputs = torch.zeros(left_sentence_length, emd_dimension)
rinpus = XassignEmbeddingValuesX

local part2 = modelTrained.convModel:forward({linputs, rinputs})
local output = modelTrained.softMaxC:forward(part2)
local val = torch.range(0, 5, 1):dot(output:exp()) 
return val/5

The ouput variable 'val' contains a similarity score between [0,1]. The input linputs1/rinputs are torch tensors and you need to fill in the word embedding values for both.

Example Deployment Script with Our Trained Model

We provide one example file for deployment: testDeployTrainedModel.lua. So it is easier for you to directly use our model. Run:

$ th testDeployTrainedModel.lua

This deployment file will use the trained model (assume you have downloaded the trained model from the above link), and it will generate scores given all test sentences of sick dataset. Please note the trained model is not trained on SICK data.

Ackowledgement

We thank Kai Sheng Tai for providing the preprocessing codes. We also thank the public data providers and Torch developers. Thanks.