difficulty in understanding starspace_embedding() behavior

fzhang612 commented 5 years ago

I am trying to replicate the return value of starspace_embedding() function. Here is what I have found so far.

When training a model with ngrams = 1, starspace_embedding(model, 'word1 word2') = as.matrix(model)['word1', ] + as.matrix(model)['word2', ] normalized accordingly. However this doesn't hold when the model trained with ngrams > 1.

thanks in advance

jwijffels commented 5 years ago

If you want embeddings of ngrams and your model is trained with ngram > 1, you should probably use starspace_embedding(model, 'word1 word2', type = "ngram") The embeddings are also governed by the parameter p which can be passed on to the starspace function and the default of it is 0.5. From the Starspace docs:
-p normalization parameter: we normalize sum of embeddings by deviding Size^p, when p=1, it's equivalent to taking average of embeddings; when p=0, it's equivalent to taking sum of embeddings. [0.5]

fzhang612 commented 5 years ago

Thanks for the response. However I now feel confused with the difference between starspace_embedding(model, 'word1 word2', type = 'document') and starspace_embedding(model, 'word1 word2', type = 'ngram'). If the latter is the embedding for a bigram word1_word2, trained with ngram = 2, what does the former represent and how is it calculated? Thanks.

jwijffels commented 5 years ago

Did you check by dividing your embedding summation by Size^p as I indicated. Size is for your case 2 as you have 2 Words and p is by default 0.5. That is what you get if you specify type=“document”, if you specify type=“ngram” starspace is using the hashing trick from fasttext to find out in which Bucket the ngram lies and then retrieves the embedding of that. You can inspect the c++ code for that.

fzhang612 commented 5 years ago

Yes, I did divide the embedding summation by Size^p. Let me rephrase my question in a clearer way.

If model is trained with similarity = 'dot', p = 0.5, ngrams = 1 then the following holds: starspace_embedding(model, 'word_1 word_2', type = 'document') = (as.matrix(model)['word_1', ] + as.matrix(model)['word_2', ]) / sqrt(2)

however, if the model is trained with ngrams = 2, keeping all other parameters same, then the above equation doesn't hold.

What am I missing to understand the difference between ngrams=1 model vs ngrams=2 model?

Thanks

jwijffels commented 5 years ago

starspace_embedding(type = "document", ...) does the following sequence of relevant calls https://github.com/bnosac/ruimtehol/blob/e3c95bacef76816cb89f201aaeeaa12df8162cc4/src/rcpp_textspace.cpp#L441 calls getDocVector getDocVector is https://github.com/bnosac/ruimtehol/blob/e3c95bacef76816cb89f201aaeeaa12df8162cc4/src/Starspace/src/starspace.cpp#L227 which calls parseDoc parseDoc basically splits the text alongside spaces and tab characters https://github.com/bnosac/ruimtehol/blob/e692f037c5652835a3cd759e53a0860cfb9cbc35/src/Starspace/src/starspace.cpp#L215 and calls parse parse basically maps single words to identifiers in the dictionary https://github.com/bnosac/ruimtehol/blob/9faf7a54a610785a9c7ab8055359937b81769681/src/Starspace/src/parser.cpp#L159 and at the end does addNgrams addNgrams is the relevant function here in your question. You find it at https://github.com/bnosac/ruimtehol/blob/9faf7a54a610785a9c7ab8055359937b81769681/src/Starspace/src/parser.cpp#L89 it hashes all words from your bigram and maps the hashed combination of the 2 terms to the right bucket (similarly as in fasttext). Once this is done, the embeddings of the unigrams/bigrams are normalised see here: https://github.com/bnosac/ruimtehol/blob/72a56c435f7938ee61f396db4908b2605ebb9189/src/Starspace/src/model.cpp#L144 (in case of dot division is done by dividing the sum by the Size^p, for cosine similarity it is divided by the euclidean norm)
starspace_embedding(type = "ngram", ...) does https://github.com/bnosac/ruimtehol/blob/e3c95bacef76816cb89f201aaeeaa12df8162cc4/src/rcpp_textspace.cpp#L459 which is calling this https://github.com/bnosac/ruimtehol/blob/e692f037c5652835a3cd759e53a0860cfb9cbc35/src/Starspace/src/starspace.cpp#L231 you'll see the same thing. The bigrams are mapped to a hashed combination of the 2 terms (similarly as in fasttext)

So to go short, for bigrams, the unigram words the embeddings are retrieved and for the bigram (if the model was trained with ngram > 1) it gets the embedding of the right hashed bucket. That is Starspace stores only embeddings of 1 word, not of bigrams. For bigrams or ngrams they are hashed combinations of the words consisting of the ngram.

jwijffels commented 5 years ago

Nevertheless, although reproducing these hashed combinations from R is non-reproducible without touching some C++ code, I tried the following experiments and hoped at the last example that if I took the ngram embedding of the bigram, and add it to the embeddings of the unigrams, that would be the same as the document embedding but apparently this is not what happened. Maybe it would be nice to ask this to the Starspace authors themselves

library(ruimtehol)
data(dekamer, package = "ruimtehol")
dekamer$text <- gsub("\\.([[:digit:]]+)\\.", ". \\1.", x = dekamer$question)
dekamer$text <- strsplit(dekamer$text, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, 
                       FUN = function(x) paste(x, collapse = " "))

model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 1, p = 0.5,
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
embedding
colSums(embedding_dictionary[c("federale", "politie"), ]) / 2^0.5

model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "cosine",
                        early_stopping = 0.8, ngram = 1, 
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
euclidean_norm <- function(x) sqrt(sum(x^2))
manual <- colSums(embedding_dictionary[c("federale", "politie"), ])
manual / euclidean_norm(manual)
embedding

## does not work as expected
## it really makes sense to ask this to starspace authors
model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 2, p = 0,
                        dim = 10, minCount = 5)
emb_doc <- starspace_embedding(model, "federale politie", type = "document")
emb_ngram <- starspace_embedding(model, "federale politie", type = "ngram")
embedding_dictionary <- as.matrix(model)
emb_doc
manual <- rbind(embedding_dictionary[c("federale", "politie"), ], 
                emb_ngram)
colSums(manual)

bnosac / ruimtehol

difficulty in understanding starspace_embedding() behavior #6