facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.92k stars 4.72k forks source link

How can we get the vector of a paragraph? #26

Closed xchangcheng closed 8 years ago

xchangcheng commented 8 years ago

I have ever tried doc2vec (from gensim, based on word2vec), with which I can extract fixed length vector for variant length paragraphs. Can I do the same with fastText?

Thank you!

Developerayo commented 8 years ago

Yes you can On Aug 11, 2016 4:07 AM, "Changcheng Xiao" notifications@github.com wrote:

I have ever tried doc2vec (from gensim, based on word2vec), with which I can extract extract fixed length vector for variant length paragraphs. Can I do the same with fastText?

Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/fastText/issues/26, or mute the thread https://github.com/notifications/unsubscribe-auth/ATll0AOsUPXZ2NVb9v3qnbVBVOlIM4L7ks5qepHlgaJpZM4Jhwz1 .

xchangcheng commented 8 years ago

Could you show me how to do that in fastText? Thx.

amirothman commented 8 years ago

Roughly, correct me if I'm wrong. For doc2vec, the paragraph vector is obtained by having a identifier token within a document. The vector representation of a particular document, would be the vector of that identifier token. I have not tried this with fastText yet, but I guess if you would add an identifier token within your document, you would get the vector of the doc. But from reading the paper, fastText uses subword/character-level features as well. So I'm not sure how well the resulting vectors would be. Let me know if you have any good/bad results.

gojomo commented 8 years ago

I don't see a PV-Doc2Vec feature in fastText, currently. When he was at Google Mikolov once offered a quick-and-minimal patch to word2vec.c that enabled PV by treating the first token in a sentence as the special paragraph-vector, added to all sliding-contexts. A similarly minimal bolt-on seems possible here, but if the feature were considered important more capabilities might be desirable – allowing multiple such tags per text, or storing them separately from words. (Doc2Vec in gensim does both of these.) And as @amirothman implies, you'd likely want to exempt such symbolic tags from fastText's subword-composition.

a455bcd9 commented 8 years ago

Depending on the task you want to perform, you may simply average word embeddings of all words in the paragraph. (If doing so, you should normalize the word vectors first, so that they all have a norm equal to one.)

According to Kenter et al. 2016, this approach "has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks.

However, according to Le and Mikolov, this method performs poorly for sentiment analysis tasks and/or long texts, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm".

shriabhi78 commented 8 years ago

I like it...

amirothman commented 8 years ago

According to Bag of Tricks for Efficient Text Classification

Unlike unsupervisedly trained word vectors from word2vec, our word features can be averaged together to form good sentence representations.

So, I guess averaging the vectors out won't be too bad of an idea in this case.

a455bcd9 commented 8 years ago

@amirothman: the embeddings learned in Bag of Tricks for Efficient Text Classification where specifically trained for a classification task, that's why you can average them (another similar paper: https://arxiv.org/abs/1606.04640 ). On the other hand, as @EdouardGrave said on https://github.com/facebookresearch/fastText/issues/46 : "Averaging the vectors from a pre-trained skip-gram model to obtain vectors for larger chunks of text does not work well for classification tasks. This has been observed by multiple authors from the NLP community."

mbaniasad commented 8 years ago

so what is the best way to get a vector representation of a paragraph for sentiment classification?

bkj commented 8 years ago

@mbaniasad Reiterating @a455bcd9, if the word vectors are produced w/ FastText, then averaging the word vectors is apparently reasonable. If the word vectors are produced via some other method (eg skipgram) then this is not a good idea (and you could maybe look at something like the Doc2Vec implementation in gensim)

piotr-bojanowski commented 8 years ago

Hi,

Since commit d652288bad3d5e30a350c67599df4c92dc471960, there is an option in fasttext to output the vector of a paragraph for a model trained for text classification. You can use it as: $ ./fasttext print-vectors model.bin < text.txt print-vectors checks for the kind of model (supervised classification / word representation) either outputs one vector per line or one vector per word.

gandalfsaxe commented 7 years ago

In short terms, how does it calculate the sentence level vector?

I'm curious to why sum(vec(words)) != vec(sentence)

wjgan7 commented 7 years ago

Don't want to make any assumptions, but looking at lines 367-385 in the source code for fasttext.cc, it appears as if they're simply averaging the normalized word vecs in the sentence.

daisukelab commented 7 years ago

Let me share my summary, and I welcome your comment for anything wrong:

  1. Make sure to put label as the first item for each line in training text dataset.
  2. Train a model specifically for classification. ex) fasttext supervised -input your_training_text -output your_model
  3. Get sentence level vector with the model. ex) fasttext print-sentence-vectors your_model.bin < your_test_sentence.txt
EmilStenstrom commented 7 years ago

I might be slow, but I still don't get how this is supposed to work. I'm interested in forming document vectors for arbitrary blocks of text. To me, it makes sense that I should train word vectors the unsupervised way, and combine those vectors. But reading the discussion above it seems only the vectors trained for classification are made to combine. Problem is, what should my labels be? One label with all my text?

murhafh commented 7 years ago

I am able to use print the sentence vector with a skipgram model. It is printing a vector of same length as of the words vector for this model, is there any insights if this sentence vector is actually correctly calculated and if yes how it is being generated?

rajivgrover009 commented 6 years ago

does gensim implementation of fastext has ability to produce sentence vectors??

mino98 commented 6 years ago

@a455bcd9 when you say:

If doing so, you should normalize the word vectors first, so that they all have a norm equal to one.

I guess you mean normalize element-wise (across each dimension for all words), right? As @wjgan7 and @piotr-bojanowski pointed out, fasttext doesn't seem to bother with such normalization step and it simply averages (element-wise) word vectors.

Am I missing anything?


@rajivgrover009 : no, at the moment it does not.

YannDubs commented 6 years ago

@mino98 just looking at getSentenceVector in fastText/src/fasttext.cc it seems that :

Note that getSentenceVector also averages with the vector of EOS : </s>.

bhushanbrb commented 6 years ago

Doc2Vec works well but breaks when word is not in vocabulary where as fastText does not because it uses n gram approach . In Gensim Doc2Vec you can Tag the document or Label for one particular sentence. How can tagging be done in fatsText? I see Text classification https://fasttext.cc/docs/en/supervised-tutorial.html. How we can achieve Tagging in fastText . I hope my question aligns with this github issue which is trying to ask how DOC2Vec like functionality can be done using fastText?

DhananjayKimothi commented 5 years ago

Is it clear to anyone how sentence vectors are created through fasttext ? These are surely not the average of word vectors. I tried all possibilities, i.e. with normalizing the word vectors and then averaging or simply averaging, but it doesn't match with the sentence vector produced by Fasttext.

Any insight ?

DhananjayKimothi commented 5 years ago

Is it clear to anyone how sentence vectors are created through fasttext ? These are surely not the average of word vectors. I tried all possibilities, i.e. with normalizing the word vectors and then averaging or simply averaging, but it doesn't match with the sentence vector produced by Fasttext.

Any insight ?

Sorry, missed @YannDubs point.. true they are using '<\s>' vector for averaging. Based on some of the experiment that I was doing with fastText (supervised-- i.e for sentence classification), some basic points that are simple but may confuse someone:

  1. Word vectors are learned in the process of training fastText, same as any other word2vec based model. (run with different epochs and test the vector for any word and here you go..:))
  2. For a new sentence, the word vectors will be averaged, note @YannDubs point, that is along with words '<\s>' vector is also averaged.
  3. If in a new sentence a word appears that is not already in the vocabulary then it is simply neglected. If you check the vector for any such word you will get all zeros, note that it is not considered at all and hence you will get no changes in the average value.
Jakobhenningjensen commented 1 year ago

I think this answers it;

When trained for classification the sentence embedding is just the average of the word-embeddings. When trained unsupervised, the sentence-embedding is calculated by dividing each wordembedding by the (L2) norm of that embedding, and then average these scaled embeddings.