Closed jenniewilliams closed 3 years ago
You create new embeddings. As for train-test split it depends on your downstream task and validation of the embeddings on fresh data. Usually nobody cares and throws all the data.
I have a (v x v) TCM which I put into a TfIdf model using TfIdf {text2vec} . I then put the resulting matrix into a GloVe model and aggregate embeddings for each document. (v=vocab) - are the resulting vectors tf-idf weighted GloVe embeddings with a context window weighting ??? finding it hard to document what I have done... and trying to weight 'important' vocab to put into a clustering algorithm later....
I understand what a TfIdf {text2vec} does to a (d x v) TDM but confused when it is a TCM.... (d=doc)
When I am running this code in R am I fitting my data to pretrained embeddings? or am I learning my own embeddings? if the latter, do I need to test/train a proportion of my corpus? or can I put the whole lot in as the example suggests?
Many thanks, Jennie
library(text2vec) it = itoken(corpdf$text, preprocessor = tolower, tokenizer = word_tokenizer)
create a vocbulary
vocab <- create_vocabulary(it)
remove words that only occur in 1 document
vocabt <- prune_vocabulary(vocab,doc_count_min=2L) vectorizer <- vocab_vectorizer(vocabt) tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
Set up a new glove model
dim=100 glove_model = GloVe$new(rank=dim, x_max=100, learning_rate = .05, alpha=0.75)
fit the model and get word vectors
ethos_main <- glove_model$fit_transform(tcm, n_iter = 50L)
get the context vectors
ethos_context <- glove_model$components