castorini / castor

PyTorch deep learning models for text processing
http://castor.ai/
Apache License 2.0
178 stars 58 forks source link

Using pre-trained models on your own data #134

Closed alicranck closed 6 years ago

alicranck commented 6 years ago

Hi,

Is there a possibility to use the models with your own data (specifically the mp-cnn)? I couldn't find anything in the documentation.

Thanks!

tuzhucheng commented 6 years ago

Hi @alicranck thanks for your interest in our project!

Yes, you can, if you pre-process the data into the same input format.

Please take a look here as an example. The associated Python class to process this is at https://github.com/castorini/Castor/blob/master/datasets/sick.py.

MP-CNN takes pairs of texts as input. They are stored in a.toks and b.toks in the example above. You need an id and label for each pair, which are stored in id.txt and sim.txt respectively.

Victor0118 commented 6 years ago

You need to build the dataset reader and processor using torchtext for your own dataset. You can follow: https://github.com/castorini/Castor/blob/master/datasets/trecqa.py.

Then add that to https://github.com/castorini/Castor/blob/master/common/dataset.py.

We will build a doc for adding new dataset soon. Thanks!

alicranck commented 6 years ago

Generating a.toks, b.toks files like in the sick dataset did the job. Thanks!