Closed roskoN closed 5 years ago
@roskoN thanks for reporting. Yes please send a PR if that's the case.
@roskoN Hi, do you still observe the off-by-one issue from importing fastText model? If so, would you mind sending a PR to fix? Thanks.
Closing issue as no recent update.
It still does that with models I train myself on fasttext, there is an extra space at the end of each line. This is easily fixed.
Remove the first line:
sed -i '1d' model.tsv
Remove extra space at end of line:
sed -i 's/.$//' model.tsv
Side note: One must make sure that the labels are the last lines of the embedding file, otherwise, somehow, pretty much any word is used for prediction.
Hi,
First of all, thank you making this great project open to the public. I could already achieve some good results for my experiments in information retrieval.
Now, I am trying to fine-tune and improve the model, by reusing pretrained embeddings from fasttext. In #94 , it was mentioned that one could just take the fasttext vectors by changing the extension to .tsv and specifying the dimensions with the
-dim
argument to300
. So, I did ,but the dimension gets overridden by StarSpace to301
and of course after that an error for excess fields. I had to additionally remove the first line which specifies the line count and dimension size, otherwise the dimension gets override to1
. Am I doing something wrong? Here is the output I get:EDIT1: I looked a bit closer into the code of
StarSpace::initFromTsv
and saw there is a call toboost::split
that splits the string along spaces or tabs. And it returns a vector where the last element is empty. Thevec.count - 1
is used as dim. Due to the fact that there is one "additional" element in the list, it code thinks there are not 300 elements but 301. I assume this is due to the newline character at the end of each line. I could later send a PR fixing this.Cheers