bnosac / ruimtehol

R package to Embed All the Things! using StarSpace
Mozilla Public License 2.0
101 stars 13 forks source link

Word embeddings #37

Closed guivivi closed 3 years ago

guivivi commented 3 years ago

Hi Jan, just a quick question, that maybe it's too basic but I'd like to be sure of the answer. It is about word embeddings and examine the typical example of london = paris - france + uk + england

With your package, would the right approach be? (assuming that x contains the data):

set.seed(123) model <- embed_wordspace(x, early_stopping = 0.9, dim = 15, ws = 7, epoch = 10, minCount = 1, ngrams = 1) plot(model) word_vectors <- as.matrix(model)

mostsimilar <- embedding_similarity(word_vectors, word_vectors["paris", ] + word_vectors["france", ] + word_vectors["uk", ] + word_vectors["england", ]) head(sort(mostsimilar[, 1], decreasing = TRUE), 10)

jwijffels commented 3 years ago

yes, does it work? do you get london on your corpus? is london in your corpus?

jwijffels commented 3 years ago

It's similar logic as shown in slides at https://www.bnosac.be/index.php/blog/100-word2vec-in-r where it is shown for wv["king", ]-wv["man", ]+wv["woman", ] whch gives queen

guivivi commented 3 years ago

Hi Jan, thank you for the answer and the slides, this clarifies my main doubt. Regarding pure results, before using my own data, I was trying to learn with the data from https://quanteda.io/articles/pkgdown/replication/text2vec.html but it looks like the data is too large.

library(quanteda) library(ruimtehol) wiki_corp <- quanteda.corpora::download(url = "https://www.dropbox.com/s/9mubqwpgls3qi9t/data_corpus_wiki.rds?dl=1") wiki_toks <- tokens(wiki_corp) wiki_toks_char <- as.character(wiki_toks) set.seed(123) model <- embed_wordspace(wiki_toks_char, early_stopping = 0.9, dim = 15, ws = 7, epoch = 5, minCount = 5, ngrams = 1, maxTrainTime = 2) plot(model) word_vectors <- as.matrix(model)

input_word <- word_vectors["paris", ] + word_vectors["france", ] + word_vectors["uk", ] + word_vectors["england", ] mostsimilar <- embedding_similarity(word_vectors, input_word) head(sort(mostsimilar[, 1], decreasing = TRUE), 10)

 preying communitarian     allegheny expropriation  orthographic      funneled    netherland    plancherel 
0.8496942     0.8458253     0.8426695     0.8258512     0.8235179     0.8120142     0.8074407     0.8054920 
squeezing          cogs 
0.7989219     0.7974576 
jwijffels commented 3 years ago

It looks like what you have in that data is 1 document. What you now basically did is split it into words and pass it on to the starspace model as if each word is there without any context. I really wonder if it can even train something then. Maybe you could just collapse the text and next use early_stopping = 1 as you only have 1 document to get something out and make it train longer than your specified 2 seconds.

e.g. like this (e.g. - haven't looked in detail but you get the point)

library(quanteda)
wiki_corp <- quanteda.corpora::download(url = "https://www.dropbox.com/s/9mubqwpgls3qi9t/data_corpus_wiki.rds?dl=1")
wiki_toks <- tokens(wiki_corp)
wiki_toks_char <- as.character(wiki_toks)
wiki_toks_char <- paste(wiki_toks_char, collapse = " ")

library(ruimtehol)
set.seed(123)
model <- embed_wordspace(wiki_toks_char, early_stopping = 1, dim = 50, ws = 7,  epoch = 20, 
                                               minCount = 5, ngrams = 1, thread = 10)
plot(model)
word_vectors <- as.matrix(model)
> input_word <- word_vectors["paris", ] - word_vectors["france", ] + word_vectors["germany",  ]
> mostsimilar <- embedding_similarity(word_vectors, input_word)
> head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
     paris     berlin     munich     vienna    germany     moscow  magdeburg     erfurt    bourget ingolstadt 
 0.8610147  0.7344087  0.7224967  0.7036711  0.6924272  0.6498813  0.6333962  0.6323813  0.6121131  0.6058791 
> input_word <- word_vectors["paris",  ] - word_vectors["france", ] + word_vectors["uk", ]
> mostsimilar <- embedding_similarity(word_vectors, input_word)
> head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
       uk    london     paris   toronto     opens    munich        us     tokyo    sydney  ramstein 
0.8251192 0.6786995 0.6572176 0.5852645 0.5624896 0.5376666 0.5343892 0.5289714 0.5259282 0.5213474 
jwijffels commented 3 years ago

If you are just interested in word vectors, you can as well use the word2vec R package

library(word2vec)
wiki_corp <- quanteda.corpora::download(url = "https://www.dropbox.com/s/9mubqwpgls3qi9t/data_corpus_wiki.rds?dl=1")
txt   <- quanteda::texts(wiki_corp)
txt   <- txt_clean_word2vec(txt)
## either you can provide to word2vec a character vector of length > 1 or a path to a file, choosing a path to a file as there seems to be only one document in that data
txt   <- writeLines(txt, "traindata.txt") 
set.seed(9876543210)
model <- word2vec("traindata.txt", dim = 50, min_count = 5, type = "cbow", window = 7, iter = 10, threads = 1)
predict(model, c("paris", "france", "london"), type = "nearest", top_n = 5)
$paris
   term1       term2 similarity rank
1  paris universelle  0.8686339    1
2  paris montpellier  0.8484883    2
3  paris       basel  0.8481658    3
4  paris       troja  0.8445289    4
5  paris      venice  0.8419283    5
6  paris      vienna  0.8377650    6
7  paris     leipzig  0.8351578    7
8  paris      leuven  0.8329087    8
9  paris      bruges  0.8318095    9
10 paris        lyon  0.8313862   10

$france
    term1       term2 similarity rank
1  france       spain  0.9140415    1
2  france     belgium  0.9121987    2
3  france netherlands  0.8901936    3
4  france    portugal  0.8820095    4
5  france       italy  0.8812329    5
6  france     austria  0.8533272    6
7  france     germany  0.8522602    7
8  france  luxembourg  0.8416032    8
9  france     hungary  0.8414242    9
10 france   huguenots  0.8411627   10

$london
    term1           term2 similarity rank
1  london         croydon  0.9011471    1
2  london       edinburgh  0.8949940    2
3  london         glasgow  0.8678643    3
4  london       guildhall  0.8607314    4
5  london      birmingham  0.8571822    5
6  london           leeds  0.8547540    6
7  london        highgate  0.8505970    7
8  london buckinghamshire  0.8490746    8
9  london          dublin  0.8462750    9
10 london     aberystwyth  0.8400701   10
guivivi commented 3 years ago

Many thanks for all the insights, Jan.