bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Predicting topics seems not consistent #38

Closed rdatasculptor closed 5 years ago

rdatasculptor commented 5 years ago

Thanks again for this brilliant package!

I ran into someting peculiar. Maybe you know why this happens. When I try to predict topics of new documents using a trained LDA model, I don't get the same predictions every time when running te prediction script several times.

Following your example I tried this script:


library(udpipe)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "fr")

ud_model <- udpipe_download_model(language = "french")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback, doc_id = comments$id)
x <- as.data.frame(x)
x$topic_level_id <- unique_identifier(x, fields = c("doc_id", "paragraph_id", "sentence_id"))
dtf <- subset(x, upos %in% c("NOUN"))
dtf <- document_term_frequencies(dtf, document = "topic_level_id", term = "lemma")
dtm <- document_term_matrix(x = dtf)
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 5)
dtm_clean <- dtm_remove_terms(dtm_clean, terms = c("appartement", "appart", "eter"))
dtm_clean <- dtm_remove_tfidf(dtm_clean, top = 50)
library(topicmodels)
training <- dtm_clean[1:400,]
newdata <- dtm_clean[401:403,]
m <- LDA(training, k = 4, method = "Gibbs", 
         control = list(nstart = 5, burnin = 2000, best = TRUE, seed = 1:5))
scoreslist <- list()
for (i in 1:10){
scores <- predict(m, newdata = newdata, type = "topics", 
                  labels = c("labela", "labelb", "labelc", "xyz"))
scoreslist[[i]] <- scores
}
scorelist

Although most of probalities are the same over and over again. However you will also notice that sometimes the probabilities will differ. Isn't that peculiar? Wouldn't you expect exactly the same outcome every time you run the scriipt?

When I follow the solution in this SO topic., using the posterior function directly instead of the udpipe prediction function, there seems to be no changes in the outcome when I run it serveral times:

library(topicmodels)
data(AssociatedPress)

train <- AssociatedPress[1:100,]
test <- AssociatedPress[149:150,]

train.lda <- LDA(train,5)
scoreslist <- list()
for (i in 1:10){
test.topics <- posterior(train.lda,test)
scoreslist[[i]] <- test.topics[[2]]
}
scoreslist

Is there a difference in the way udpipe makes a documt term matrix that causes this problem? Are you familiar with this problem, and do you know how to solve it? Most appreciated!

jwijffels commented 5 years ago

can you use set.seed(123456789) before you use the function LDA and the predict and rerun? predict.LDA just uses posterior from the topicmodels package. If you run the code below, you'll see that the number of unique rows is the same as the number of rows of newdata. It's gibbs sampling playing with your mind.

for (i in 1:10){
  set.seed(123456789)
  scores <- predict(m, newdata = newdata, type = "topics", 
                    labels = c("labela", "labelb", "labelc", "xyz"))
  scoreslist[[i]] <- scores
}
scoreslist
nrow(unique(data.table::rbindlist(scoreslist)))
nrow(newdata)

Compare this to, you'll see the same happening. Posterior with gibbs sampling is different than posterior with VEM

library(topicmodels)
data(AssociatedPress)

train <- AssociatedPress[1:100,]
test <- AssociatedPress[149:150,]

train.lda <- LDA(train,5, method = "Gibbs", 
                 control = list(nstart = 5, iter = 2000, best = TRUE, seed = 1:5))
scoreslist <- list()
for (i in 1:10){
  test.topics <- posterior(train.lda,test)
  scoreslist[[i]] <- as.data.frame(test.topics[[2]])
}
nrow(unique(data.table::rbindlist(scoreslist)))
nrow(test)

You were comparing an LDA model fitted with Gibbs sampling to an LDA model fitted with VEM

rdatasculptor commented 5 years ago

mystery solved! Thank you so much for you time and patience.

Kind regards, Jelle

jwijffels commented 5 years ago

@rdatasculptor On another note, I see that you are using LDA, have you also tried to use BTM (https://github.com/bnosac/BTM) - is topic modelling for short texts.

rdatasculptor commented 5 years ago

I was not aware of this package yet. I will definately try out. Thanks for mentioning!

rdatasculptor commented 5 years ago

I wanted to give BTM a try, but wasn't able to install it from CRAN. Installation with the help of devtools didn't help. On my current machine I am (unfortenately) not allowed to install additionals tools. Do you know another way?

I tried from https://github.com/bnosac/BTM/issues/2 the install.packages("BTM", repos = "http://www.datatailor.be/rcube", type = "source") function. The error I got was package ‘BTM’ is not available (for R version 3.5.1)

jwijffels commented 5 years ago

The package is published to CRAN but still awaiting in the CRAN queue. I've put the R package also on our local cran repo. You can also also as follows. install.packages("BTM", repos = "http://www.datatailor.be/rcube", type = "source") But for this, you need to have Rtools if on windows. Otherwise wait 1-2 weeks untill CRAN approves the package.

rdatasculptor commented 5 years ago

That version causes this error: Warning in install.packages : package ‘BTM’ is not available (for R version 3.5.1)

Best is to wait for the CRAN version?

jwijffels commented 5 years ago

Ah. I see, I must have erroneously removed it from our cran repo yesterday. Normally it will be on cran within 1 week if cran is still doing something in this period of the year.

rdatasculptor commented 5 years ago

Will you put it back in you cran repo? Otherwise I will try to wait patiently :+1:

jwijffels commented 5 years ago

FYI. BTM Package is on cran since today https://CRAN.R-project.org/package=BTM

rdatasculptor commented 5 years ago

Thanks! I will definately give it a try as soon as possible