Closed xchangcheng closed 8 years ago
Yes you can On Aug 11, 2016 4:07 AM, "Changcheng Xiao" notifications@github.com wrote:
I have ever tried doc2vec (from gensim, based on word2vec), with which I can extract extract fixed length vector for variant length paragraphs. Can I do the same with fastText?
Thank you!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/fastText/issues/26, or mute the thread https://github.com/notifications/unsubscribe-auth/ATll0AOsUPXZ2NVb9v3qnbVBVOlIM4L7ks5qepHlgaJpZM4Jhwz1 .
Could you show me how to do that in fastText? Thx.
Roughly, correct me if I'm wrong. For doc2vec, the paragraph vector is obtained by having a identifier token within a document. The vector representation of a particular document, would be the vector of that identifier token. I have not tried this with fastText yet, but I guess if you would add an identifier token within your document, you would get the vector of the doc. But from reading the paper, fastText uses subword/character-level features as well. So I'm not sure how well the resulting vectors would be. Let me know if you have any good/bad results.
I don't see a PV-Doc2Vec feature in fastText, currently. When he was at Google Mikolov once offered a quick-and-minimal patch to word2vec.c that enabled PV by treating the first token in a sentence as the special paragraph-vector, added to all sliding-contexts. A similarly minimal bolt-on seems possible here, but if the feature were considered important more capabilities might be desirable – allowing multiple such tags per text, or storing them separately from words. (Doc2Vec in gensim does both of these.) And as @amirothman implies, you'd likely want to exempt such symbolic tags from fastText's subword-composition.
Depending on the task you want to perform, you may simply average word embeddings of all words in the paragraph. (If doing so, you should normalize the word vectors first, so that they all have a norm equal to one.)
According to Kenter et al. 2016, this approach "has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks.
However, according to Le and Mikolov, this method performs poorly for sentiment analysis tasks and/or long texts, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm".
I like it...
According to Bag of Tricks for Efficient Text Classification
Unlike unsupervisedly trained word vectors from word2vec, our word features can be averaged together to form good sentence representations.
So, I guess averaging the vectors out won't be too bad of an idea in this case.
@amirothman: the embeddings learned in Bag of Tricks for Efficient Text Classification where specifically trained for a classification task, that's why you can average them (another similar paper: https://arxiv.org/abs/1606.04640 ). On the other hand, as @EdouardGrave said on https://github.com/facebookresearch/fastText/issues/46 : "Averaging the vectors from a pre-trained skip-gram model to obtain vectors for larger chunks of text does not work well for classification tasks. This has been observed by multiple authors from the NLP community."
so what is the best way to get a vector representation of a paragraph for sentiment classification?
@mbaniasad Reiterating @a455bcd9, if the word vectors are produced w/ FastText, then averaging the word vectors is apparently reasonable. If the word vectors are produced via some other method (eg skipgram) then this is not a good idea (and you could maybe look at something like the Doc2Vec implementation in gensim)
Hi,
Since commit d652288bad3d5e30a350c67599df4c92dc471960, there is an option in fasttext to output the vector of a paragraph for a model trained for text classification.
You can use it as:
$ ./fasttext print-vectors model.bin < text.txt
print-vectors checks for the kind of model (supervised classification / word representation) either outputs one vector per line or one vector per word.
In short terms, how does it calculate the sentence level vector?
I'm curious to why sum(vec(words)) != vec(sentence)
Don't want to make any assumptions, but looking at lines 367-385 in the source code for fasttext.cc, it appears as if they're simply averaging the normalized word vecs in the sentence.
Let me share my summary, and I welcome your comment for anything wrong:
fasttext supervised -input your_training_text -output your_model
fasttext print-sentence-vectors your_model.bin < your_test_sentence.txt
I might be slow, but I still don't get how this is supposed to work. I'm interested in forming document vectors for arbitrary blocks of text. To me, it makes sense that I should train word vectors the unsupervised way, and combine those vectors. But reading the discussion above it seems only the vectors trained for classification are made to combine. Problem is, what should my labels be? One label with all my text?
I am able to use print the sentence vector with a skipgram model. It is printing a vector of same length as of the words vector for this model, is there any insights if this sentence vector is actually correctly calculated and if yes how it is being generated?
does gensim implementation of fastext has ability to produce sentence vectors??
@a455bcd9 when you say:
If doing so, you should normalize the word vectors first, so that they all have a norm equal to one.
I guess you mean normalize element-wise (across each dimension for all words), right? As @wjgan7 and @piotr-bojanowski pointed out, fasttext doesn't seem to bother with such normalization step and it simply averages (element-wise) word vectors.
Am I missing anything?
@rajivgrover009 : no, at the moment it does not.
@mino98 just looking at getSentenceVector
in fastText/src/fasttext.cc
it seems that :
Vector::norm()
of fastText/src/vector.cc
).Note that getSentenceVector
also averages with the vector of EOS : </s>
.
Doc2Vec works well but breaks when word is not in vocabulary where as fastText does not because it uses n gram approach . In Gensim Doc2Vec you can Tag the document or Label for one particular sentence. How can tagging be done in fatsText? I see Text classification https://fasttext.cc/docs/en/supervised-tutorial.html
. How we can achieve Tagging in fastText . I hope my question aligns with this github issue which is trying to ask how DOC2Vec like functionality can be done using fastText?
Is it clear to anyone how sentence vectors are created through fasttext ? These are surely not the average of word vectors. I tried all possibilities, i.e. with normalizing the word vectors and then averaging or simply averaging, but it doesn't match with the sentence vector produced by Fasttext.
Any insight ?
Is it clear to anyone how sentence vectors are created through fasttext ? These are surely not the average of word vectors. I tried all possibilities, i.e. with normalizing the word vectors and then averaging or simply averaging, but it doesn't match with the sentence vector produced by Fasttext.
Any insight ?
Sorry, missed @YannDubs point.. true they are using '<\s>' vector for averaging. Based on some of the experiment that I was doing with fastText (supervised-- i.e for sentence classification), some basic points that are simple but may confuse someone:
I think this answers it;
When trained for classification the sentence embedding is just the average of the word-embeddings. When trained unsupervised, the sentence-embedding is calculated by dividing each wordembedding by the (L2) norm of that embedding, and then average these scaled embeddings.
I have ever tried doc2vec (from gensim, based on word2vec), with which I can extract fixed length vector for variant length paragraphs. Can I do the same with fastText?
Thank you!