dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

Qestion: TF-IDF model, proofread code if possible #299

Closed Liskutin closed 5 years ago

Liskutin commented 5 years ago

Hello, I am sorry if this is a wrong place to post a question.

I have two questions: ~ A) I am trying to do text similarity using TF-IDF cosine similarity. I somehow managed it in the fallowing code, reading the documentation.

However, I have a question. I have 17 speakers, each speaks 20 times. So I did create common vector space (vectorizer) and to it projected DTM-speakerA and DTM-SpeakerB (did not do all 17 at the same time) and did psim2 similarity. To ensure the TF-IDF transformation I created DTM from all 17 speakers (from the common vector space), fit_transform TF-IDF model on it and then on individual DTM-speakerA.

Question is, if my thinking with fit_transform and $transform was correct of the resulting weights will be wrong.

#Sorting my datasec, not important
corpus_raw$text_clean <- prep_fun(corpus_raw$stems)
corpus_raw <- corpus_raw %>% 
  arrange(year) %>% arrange(country) %>% 
  select(year, country, text_clean) %>% filter(country %in% country_ID)

# Creating COMMON VECTOR SPACE FOR DOCUMENTS
  # First step, create tokenizer & vocabulary from all documents !!!
  it_all <- itoken(corpus_raw$text_clean,
                   tokenizer = tok_fun)
  vocab = create_vocabulary(it_all, ngram = c(1L, 2L)) %>% prune_vocabulary(doc_proportion_max = 0.98)  
  # Second step, create Document Term Matrix (DTM) from the vocabulary
  vectorizer = vocab_vectorizer(vocab)
  dtm_all = create_dtm(it_all, vectorizer)

# TF-IDF Normalization - not all speakers are equal, not all words are equal!
  # Define TF-IDF model - model is used to 'translate' other parts of text later on.
  tfidf_model = TfIdf$new()
  # Fit model to DTM_ALL data & Transform the data with fitted model
  dtm_all_tfidf = fit_transform(dtm_all, tfidf_model)

# Alright, time to find text similarity with tf-idf 
# I already have loaded vocabulary for whole data, tf-idf model for whole corpus.
# What I do now, is sort wanted Countries from the corpus to find distance between them
country_A <- corpus_raw %>% filter(country %in% 'CHN') %>%
  select(year, text_clean) %>% rename_at(vars("text_clean"), ~ "text_CHN")
country_B <- corpus_raw %>% filter(country %in% 'LVA') %>%
  select(year, text_clean, country) %>% rename_at(vars("text_clean"), ~ "text_LVA")
corpus_AB <- left_join(country_A, country_B, by = "year") %>% 
  drop_na()
# Now, I am going to define (Itoken) the sets of documents to measure distance on
it_country_A <- itoken(corpus_AB$text_CHN,
                       tokenizer = tok_fun)
it_country_B <- itoken(corpus_AB$text_LVA,
                       tokenizer = tok_fun)
# Than, I am going to create Document Term Frequency matrixes for chosen country A and country B in the same VECTOR SPACE!!!!
dtm_country_A <- create_dtm(it_country_A, vectorizer)
dtm_country_B <- create_dtm(it_country_B, vectorizer)  
# Lastly, I am going to normalize DTMs of these countries on our TFIDF Model.
dtm_country_A_tfidf <- tfidf_model$transform(dtm_country_A)
dtm_country_B_tfidf <- tfidf_model$transform(dtm_country_B)

# Finally, lets call Psim2 for finding out speech similarity in given year
similarity <- psim2(dtm_country_A_tfidf, dtm_country_B_tfidf, method = 'cosine', norm = 'none')

# Doing finishing touches - adding years to DF from corpus_AB
similarity_df <- as.data.frame(similarity)
similarity_df <- cbind(Year = corpus_AB$year, Country = corpus_AB$country, data.frame(similarity))
names(similarity_df) <- make.names(names(similarity_df))

~ B) Speech Similarity - Word Embeddings/RWMD I uploaded a tuned down version of the code on Kaggle (that way my work with Kaggle markup system was sped up - for heavy work I prefer own idle), so everyone can see a logic behind it and hopefully can point out whether there are any mistakes.

If you will be willing to take time and go through it, both datasets (original and one with text-preprocessing) are downloadable from Kaggle (you can click on data category to find my uploaded dataset).

LINK: https://www.kaggle.com/smooge/speech-similarity

dselivanov commented 5 years ago

Sorry, don't have time to validate your code. If there is an issue with text2vec - you are welcome to open an issue here, if not - there are resources like stackoverflow where you can ask for a help.