bnosac / ruimtehol

R package to Embed All the Things! using StarSpace
Mozilla Public License 2.0
99 stars 13 forks source link

difficulty in understanding starspace_embedding() behavior #6

Open fzhang612 opened 5 years ago

fzhang612 commented 5 years ago

I am trying to replicate the return value of starspace_embedding() function. Here is what I have found so far.

When training a model with ngrams = 1, starspace_embedding(model, 'word1 word2') = as.matrix(model)['word1', ] + as.matrix(model)['word2', ] normalized accordingly. However this doesn't hold when the model trained with ngrams > 1.

thanks in advance

jwijffels commented 5 years ago

If you want embeddings of ngrams and your model is trained with ngram > 1, you should probably use starspace_embedding(model, 'word1 word2', type = "ngram") The embeddings are also governed by the parameter p which can be passed on to the starspace function and the default of it is 0.5. From the Starspace docs:
-p normalization parameter: we normalize sum of embeddings by deviding Size^p, when p=1, it's equivalent to taking average of embeddings; when p=0, it's equivalent to taking sum of embeddings. [0.5]

fzhang612 commented 5 years ago

Thanks for the response. However I now feel confused with the difference between starspace_embedding(model, 'word1 word2', type = 'document') and starspace_embedding(model, 'word1 word2', type = 'ngram'). If the latter is the embedding for a bigram word1_word2, trained with ngram = 2, what does the former represent and how is it calculated? Thanks.

jwijffels commented 5 years ago

Did you check by dividing your embedding summation by Size^p as I indicated. Size is for your case 2 as you have 2 Words and p is by default 0.5. That is what you get if you specify type=“document”, if you specify type=“ngram” starspace is using the hashing trick from fasttext to find out in which Bucket the ngram lies and then retrieves the embedding of that. You can inspect the c++ code for that.

fzhang612 commented 5 years ago

Yes, I did divide the embedding summation by Size^p. Let me rephrase my question in a clearer way.

If model is trained with similarity = 'dot', p = 0.5, ngrams = 1 then the following holds: starspace_embedding(model, 'word_1 word_2', type = 'document') = (as.matrix(model)['word_1', ] + as.matrix(model)['word_2', ]) / sqrt(2)

however, if the model is trained with ngrams = 2, keeping all other parameters same, then the above equation doesn't hold.

What am I missing to understand the difference between ngrams=1 model vs ngrams=2 model?

Thanks

jwijffels commented 5 years ago

So to go short, for bigrams, the unigram words the embeddings are retrieved and for the bigram (if the model was trained with ngram > 1) it gets the embedding of the right hashed bucket. That is Starspace stores only embeddings of 1 word, not of bigrams. For bigrams or ngrams they are hashed combinations of the words consisting of the ngram.

jwijffels commented 5 years ago

Nevertheless, although reproducing these hashed combinations from R is non-reproducible without touching some C++ code, I tried the following experiments and hoped at the last example that if I took the ngram embedding of the bigram, and add it to the embeddings of the unigrams, that would be the same as the document embedding but apparently this is not what happened. Maybe it would be nice to ask this to the Starspace authors themselves

library(ruimtehol)
data(dekamer, package = "ruimtehol")
dekamer$text <- gsub("\\.([[:digit:]]+)\\.", ". \\1.", x = dekamer$question)
dekamer$text <- strsplit(dekamer$text, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, 
                       FUN = function(x) paste(x, collapse = " "))

model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 1, p = 0.5,
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
embedding
colSums(embedding_dictionary[c("federale", "politie"), ]) / 2^0.5

model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "cosine",
                        early_stopping = 0.8, ngram = 1, 
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
euclidean_norm <- function(x) sqrt(sum(x^2))
manual <- colSums(embedding_dictionary[c("federale", "politie"), ])
manual / euclidean_norm(manual)
embedding

## does not work as expected
## it really makes sense to ask this to starspace authors
model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 2, p = 0,
                        dim = 10, minCount = 5)
emb_doc <- starspace_embedding(model, "federale politie", type = "document")
emb_ngram <- starspace_embedding(model, "federale politie", type = "ngram")
embedding_dictionary <- as.matrix(model)
emb_doc
manual <- rbind(embedding_dictionary[c("federale", "politie"), ], 
                emb_ngram)
colSums(manual)