castorini / castor

PyTorch deep learning models for text processing
http://castor.ai/
Apache License 2.0
178 stars 58 forks source link

other dataset trained on vdpwi #138

Open xyx-x opened 6 years ago

xyx-x commented 6 years ago

I saw the dataset loading fixed to four dataset(sick,msrvid, trecqa, wikiqa). I wanted to know how to trained vdpwi with other datasets. what's more, how to reasonably organize the dataset. I try to copied my dataset into the file 'sick' and my embedding into the file 'GloVe', but the model trained with 0 loss. Can you give me the correct instruction?

tuzhucheng commented 6 years ago

Hi, please refer to the instructions in #134. That issue is for another model, MP-CNN, but it should work the same way. If you have more questions let us know!

xyx-x commented 6 years ago

I stored the dataset like a.toks and b.toks because i used the vdpwi with the lua version before. When i used this pytorch version, i can running the code. However, when i trained the model, the loss i got is 0. That's where the problem is.

xyx-x commented 6 years ago

I found the key of the problem. My sim.txt contains only two values which are 0 and 5 (The value of the first 5000 lines of my sim.txt is 5, and the value of the following lines is 0.),and the training loss is 0. When i changed the sim.txt which contains two values which are 0 and 4.5, the loss is not 0. How can i train correctly with my sim.txt? Can you give me some instructions?

tuzhucheng commented 6 years ago

Seems that your label is binary. How about processing your data by converting "5" to "1" so you only have 0s and 1s. Then you can take a look at https://github.com/castorini/Castor/blob/master/datasets/trecqa.py and set NUM_CLASSES to 2.