bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

How to use textrank with tfidf #35

Closed fahadshery closed 5 years ago

fahadshery commented 5 years ago

Hi,

I have been following your tutorial here and interested in finding out how to use your [textRank](https://CRAN.R-project.org/package=textrank) algorithm when done building LDA models as you mention in Use Case II?

Do you have any example code? That would be ideal.

Best,

jwijffels commented 5 years ago

The steps are:

  1. Extract keywords with textrank. Keywords are just a sequence of words.
  2. Use txt_recode_ngram to recode sequence of words with the textrank keywords (which contains several words) and add it as a column to the udpipe data.frame output. See ?txt_recode_ngram
  3. Use document_term_frequencies to calculate frequencies of the compound keyword that you added. See ?document_term_frequencies
  4. Use document_term_matrix on 3.
  5. Build an LDA model on 4.
fahadshery commented 5 years ago

Sorry, I already know how to build LDA models using textrank. I want to understand how textrank R package can help finding the most relevant sentences of documents of a certain (LDA) topic. In essence I want to Summarising topics using the most relevant sentences in a topic

jwijffels commented 5 years ago

Ok, get it. That's indeed an interesting way of summarising text, which I prefer also over using the LDA beta's. If you want to extract the most relevant sentencesfor each LDA topic. You need to

fahadshery commented 5 years ago

Thank you so much. The data is available in the textSummary package. This is what I did, could you check out and give feedback please?

library(textrank)
keyw_rank <- textrank_keywords(verbatim_tokens$token, 
                          relevant = verbatim_tokens$upos %in% c("NOUN", "ADJ"),
                          ngram_max = 3)

## Now to find simple Noun phrases, first convert Parts of Speech tags to one-letter tags which can be used to identify phrases based on regular expressions
verbatim_tokens$phrase_tag <- as_phrasemachine(verbatim_tokens$upos, type = "upos")

# now phrase_tag can be used to create regex and get simple nouns phrases:
keyw_nounphrases <- keywords_phrases(verbatim_tokens$phrase_tag, term = verbatim_tokens$token,
                                     pattern = "(A|N)*N(P+D*(A|N)*N)*", is_regex = TRUE, 
                                     detailed = FALSE)

#above gives everything, so just filter to include ngram > 1 to get compound words
keyw_nounphrases <- keyw_nounphrases %>% filter(ngram > 1)

## Now recode terms to keywords 
verbatim_tokens$term <- verbatim_tokens$token
verbatim_tokens$term <- txt_recode_ngram(verbatim_tokens$term, 
                           compound = keyw_rank$keyword, ngram = keyw_rank$ngram)

verbatim_tokens$term <- txt_recode_ngram(verbatim_tokens$term, 
                           compound = keyw_nounphrases$keyword, ngram = keyw_nounphrases$ngram)

## Keep keyword or just plain nouns
verbatim_tokens$term <- ifelse(verbatim_tokens$upos %in% "NOUN", verbatim_tokens$term,
                        ifelse(verbatim_tokens$term %in% c(keyw_rank$keyword, keyw_nounphrases$keyword), 
                                                                              verbatim_tokens$term, NA))
# create topic level id to build the term frequencies
verbatim_tokens$topic_level_id <- unique_identifier(verbatim_tokens,fields = c("doc_id","paragraph_id","sentence_id"))

## Build document/term/matrix
dtm <- document_term_frequencies(verbatim_tokens, document = "topic_level_id", term = "term")
dtm <- document_term_matrix(x = dtm)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 5)```

Now building the LDA model:

library(topicmodels)
m <- LDA(dtm, k = 5, method = "Gibbs", 
         control = list(nstart = 5, burnin = 2000, best = TRUE, seed = 1:5))

We can get topic predictions using predict on the same dtm and then extract sentences in that topic

topic1 <- predict(m, newdata = dtm, type = "topics") %>% filter(topic ==1 )

topic1 <- topic1 %>% inner_join(verbatim_tokens) %>% distinct(doc_id,sentence_id,sentences,.keep_all = TRUE) %>% select(doc_id,paragraph_id,sentence_id,sentence,upos,lemma)

Now we can run textrank on these sentences to create a summary sentences for this topic

terminology <- topic1 %>% filter(upos %in% c("NOUN", "ADJ")) %>% 
                                    select(sentence_id,lemma) %>% 
                                    distinct(sentence_id,lemma)

## Limit the number of candidates with the minhash algorithm
library(textreuse)

minhash <- minhash_generator(n = 1000, seed = 123456789) # "n" needs to be double of "bands"

candidates <- textrank_candidates_lsh(x = terminology$lemma, 
                                      sentence_id = terminology$sentence_id,
                                      minhashFUN = minhash, 
                                      bands = 500)

topic1$textrank_id <- unique_identifier(topic1, fields = c("doc_id", "paragraph_id", "sentence_id"))

tr <- textrank_sentences(data = unique(topic1[, c("textrank_id", "sentence")]), 
                         terminology = terminology)

## see more than top 20 important sentences:

summary(tr, n = 20, keep.sentence.order = TRUE)
jwijffels commented 5 years ago

I think you confuse the purpose of the issue tracker on GitHub. It is primarily for issue reporting and feature discussions. If you want other people to review your code, the platform to use is stackoverflow.

fahadshery commented 5 years ago

thanks