Closed guivivi closed 3 years ago
yes, does it work? do you get london on your corpus? is london in your corpus?
It's similar logic as shown in slides at https://www.bnosac.be/index.php/blog/100-word2vec-in-r where it is shown for wv["king", ]-wv["man", ]+wv["woman", ] whch gives queen
Hi Jan, thank you for the answer and the slides, this clarifies my main doubt. Regarding pure results, before using my own data, I was trying to learn with the data from https://quanteda.io/articles/pkgdown/replication/text2vec.html but it looks like the data is too large.
library(quanteda)
library(ruimtehol)
wiki_corp <- quanteda.corpora::download(url = "https://www.dropbox.com/s/9mubqwpgls3qi9t/data_corpus_wiki.rds?dl=1")
wiki_toks <- tokens(wiki_corp)
wiki_toks_char <- as.character(wiki_toks)
set.seed(123)
model <- embed_wordspace(wiki_toks_char, early_stopping = 0.9, dim = 15, ws = 7, epoch = 5, minCount = 5, ngrams = 1, maxTrainTime = 2)
plot(model)
word_vectors <- as.matrix(model)
input_word <- word_vectors["paris", ] + word_vectors["france", ] + word_vectors["uk", ] + word_vectors["england", ]
mostsimilar <- embedding_similarity(word_vectors, input_word)
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
preying communitarian allegheny expropriation orthographic funneled netherland plancherel
0.8496942 0.8458253 0.8426695 0.8258512 0.8235179 0.8120142 0.8074407 0.8054920
squeezing cogs
0.7989219 0.7974576
It looks like what you have in that data is 1 document. What you now basically did is split it into words and pass it on to the starspace model as if each word is there without any context. I really wonder if it can even train something then. Maybe you could just collapse the text and next use early_stopping = 1 as you only have 1 document to get something out and make it train longer than your specified 2 seconds.
e.g. like this (e.g. - haven't looked in detail but you get the point)
library(quanteda)
wiki_corp <- quanteda.corpora::download(url = "https://www.dropbox.com/s/9mubqwpgls3qi9t/data_corpus_wiki.rds?dl=1")
wiki_toks <- tokens(wiki_corp)
wiki_toks_char <- as.character(wiki_toks)
wiki_toks_char <- paste(wiki_toks_char, collapse = " ")
library(ruimtehol)
set.seed(123)
model <- embed_wordspace(wiki_toks_char, early_stopping = 1, dim = 50, ws = 7, epoch = 20,
minCount = 5, ngrams = 1, thread = 10)
plot(model)
word_vectors <- as.matrix(model)
> input_word <- word_vectors["paris", ] - word_vectors["france", ] + word_vectors["germany", ]
> mostsimilar <- embedding_similarity(word_vectors, input_word)
> head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
paris berlin munich vienna germany moscow magdeburg erfurt bourget ingolstadt
0.8610147 0.7344087 0.7224967 0.7036711 0.6924272 0.6498813 0.6333962 0.6323813 0.6121131 0.6058791
> input_word <- word_vectors["paris", ] - word_vectors["france", ] + word_vectors["uk", ]
> mostsimilar <- embedding_similarity(word_vectors, input_word)
> head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
uk london paris toronto opens munich us tokyo sydney ramstein
0.8251192 0.6786995 0.6572176 0.5852645 0.5624896 0.5376666 0.5343892 0.5289714 0.5259282 0.5213474
If you are just interested in word vectors, you can as well use the word2vec R package
library(word2vec)
wiki_corp <- quanteda.corpora::download(url = "https://www.dropbox.com/s/9mubqwpgls3qi9t/data_corpus_wiki.rds?dl=1")
txt <- quanteda::texts(wiki_corp)
txt <- txt_clean_word2vec(txt)
## either you can provide to word2vec a character vector of length > 1 or a path to a file, choosing a path to a file as there seems to be only one document in that data
txt <- writeLines(txt, "traindata.txt")
set.seed(9876543210)
model <- word2vec("traindata.txt", dim = 50, min_count = 5, type = "cbow", window = 7, iter = 10, threads = 1)
predict(model, c("paris", "france", "london"), type = "nearest", top_n = 5)
$paris
term1 term2 similarity rank
1 paris universelle 0.8686339 1
2 paris montpellier 0.8484883 2
3 paris basel 0.8481658 3
4 paris troja 0.8445289 4
5 paris venice 0.8419283 5
6 paris vienna 0.8377650 6
7 paris leipzig 0.8351578 7
8 paris leuven 0.8329087 8
9 paris bruges 0.8318095 9
10 paris lyon 0.8313862 10
$france
term1 term2 similarity rank
1 france spain 0.9140415 1
2 france belgium 0.9121987 2
3 france netherlands 0.8901936 3
4 france portugal 0.8820095 4
5 france italy 0.8812329 5
6 france austria 0.8533272 6
7 france germany 0.8522602 7
8 france luxembourg 0.8416032 8
9 france hungary 0.8414242 9
10 france huguenots 0.8411627 10
$london
term1 term2 similarity rank
1 london croydon 0.9011471 1
2 london edinburgh 0.8949940 2
3 london glasgow 0.8678643 3
4 london guildhall 0.8607314 4
5 london birmingham 0.8571822 5
6 london leeds 0.8547540 6
7 london highgate 0.8505970 7
8 london buckinghamshire 0.8490746 8
9 london dublin 0.8462750 9
10 london aberystwyth 0.8400701 10
Many thanks for all the insights, Jan.
Hi Jan, just a quick question, that maybe it's too basic but I'd like to be sure of the answer. It is about word embeddings and examine the typical example of london = paris - france + uk + england
With your package, would the right approach be? (assuming that
x
contains the data):set.seed(123)
model <- embed_wordspace(x, early_stopping = 0.9, dim = 15, ws = 7, epoch = 10, minCount = 1, ngrams = 1)
plot(model)
word_vectors <- as.matrix(model)
mostsimilar <- embedding_similarity(word_vectors, word_vectors["paris", ] + word_vectors["france", ] + word_vectors["uk", ] + word_vectors["england", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)