dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

Verification of what the R implementation is doing #330

Closed jenniewilliams closed 3 years ago

jenniewilliams commented 3 years ago

When I am running this code in R am I fitting my data to pretrained embeddings? or am I learning my own embeddings? if the latter, do I need to test/train a proportion of my corpus? or can I put the whole lot in as the example suggests?

Many thanks, Jennie

library(text2vec) it = itoken(corpdf$text, preprocessor = tolower, tokenizer = word_tokenizer)

create a vocbulary

vocab <- create_vocabulary(it)

remove words that only occur in 1 document

vocabt <- prune_vocabulary(vocab,doc_count_min=2L) vectorizer <- vocab_vectorizer(vocabt) tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

Set up a new glove model

dim=100 glove_model = GloVe$new(rank=dim, x_max=100, learning_rate = .05, alpha=0.75)

fit the model and get word vectors

ethos_main <- glove_model$fit_transform(tcm, n_iter = 50L)

get the context vectors

ethos_context <- glove_model$components

dselivanov commented 3 years ago

You create new embeddings. As for train-test split it depends on your downstream task and validation of the embeddings on fresh data. Usually nobody cares and throws all the data.

jenniewilliams commented 3 years ago

I have a (v x v) TCM which I put into a TfIdf model using TfIdf {text2vec} . I then put the resulting matrix into a GloVe model and aggregate embeddings for each document. (v=vocab) - are the resulting vectors tf-idf weighted GloVe embeddings with a context window weighting ??? finding it hard to document what I have done... and trying to weight 'important' vocab to put into a clustering algorithm later....

I understand what a TfIdf {text2vec} does to a (d x v) TDM but confused when it is a TCM.... (d=doc)