facebookresearch / StarSpace

Learning embeddings for classification, retrieval and ranking.
MIT License
3.94k stars 531 forks source link

initModel with fasttext #163

Closed roskoN closed 5 years ago

roskoN commented 6 years ago

Hi,

First of all, thank you making this great project open to the public. I could already achieve some good results for my experiments in information retrieval.

Now, I am trying to fine-tune and improve the model, by reusing pretrained embeddings from fasttext. In #94 , it was mentioned that one could just take the fasttext vectors by changing the extension to .tsv and specifying the dimensions with the -dim argument to 300. So, I did ,but the dimension gets overridden by StarSpace to 301 and of course after that an error for excess fields. I had to additionally remove the first line which specifies the line count and dimension size, otherwise the dimension gets override to 1. Am I doing something wrong? Here is the output I get:

./starspace train -trainFile ../../PycharmProjects/paragraph-selection/train_data -model my_model -fileFormat 'labelDoc' -thread 10 -epoch 3 -initModel ../../Downloads/wiki-fix.en.tsv -dim 300
Arguments: 
lr: 0.01
dim: 300
epoch: 3
maxTrainTime: 8640000
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: cosine
maxNegSamples: 10
negSearchLimit: 50
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: labelDoc
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
Start to load a trained embedding model in tsv format.
Setting dim from Tsv file to: 301
Loading dict from model file : ../../Downloads/wiki-fix.en.tsv
Number of words in dictionary:  2519371
Number of labels in dictionary: 0

EDIT1: I looked a bit closer into the code of StarSpace::initFromTsv and saw there is a call to boost::split that splits the string along spaces or tabs. And it returns a vector where the last element is empty. The vec.count - 1 is used as dim. Due to the fact that there is one "additional" element in the list, it code thinks there are not 300 elements but 301. I assume this is due to the newline character at the end of each line. I could later send a PR fixing this.

Cheers

ledw commented 6 years ago

@roskoN thanks for reporting. Yes please send a PR if that's the case.

ledw commented 6 years ago

@roskoN Hi, do you still observe the off-by-one issue from importing fastText model? If so, would you mind sending a PR to fix? Thanks.

ledw commented 5 years ago

Closing issue as no recent update.

LAV42 commented 5 years ago

It still does that with models I train myself on fasttext, there is an extra space at the end of each line. This is easily fixed.

Remove the first line:

sed -i '1d' model.tsv

Remove extra space at end of line:

sed -i 's/.$//' model.tsv

LAV42 commented 5 years ago

Side note: One must make sure that the labels are the last lines of the embedding file, otherwise, somehow, pretty much any word is used for prediction.