Closed theiman112860 closed 3 years ago
Hi, Tom. Indeed textmineR supports this! You can pass your stopword list to
the stopword_vec
argument in CreateDtm
or CreateTcm
. If, for some
reason, you don't want any stopwords removed, you can pass an empty vector,
i.e., stopword_vec = c()
.
On Mon, Mar 15, 2021 at 2:07 PM theiman112860 @.***> wrote:
Hi Tommy,
Is it possible to use a custom stop word list? I am doing some analysis on Romanized Urdu and found a custom stop word list from the Python based NLTK package If so, how would I do it? I can do it using the tm package like so:
obtained a list of Urdu stopwords from the python NLTK package
stopwords<- c('ai', 'ayi', 'hy', 'hai', 'main', 'ki', 'tha', 'koi', 'ko', 'sy', 'woh ', 'bhi', 'aur', 'wo', 'yeh', 'rha', 'hota', 'ho', 'ga', 'ka', 'le', 'lye ', 'kr', 'kar', 'lye', 'liye', 'hotay', 'waisay', 'gya', 'gaya', 'kch', 'ab', 'thy', 'thay', 'houn', 'hain', 'han', 'to', 'is', 'hi', 'jo', 'kya', 'thi', 'se', 'pe', 'phr', 'wala', 'waisay', 'us', 'na', 'ny', 'hun', 'rha', 'raha', 'ja', 'rahay', 'abi', 'uski', 'ne', 'haan', 'acha', 'nai', 'sent', 'photo', 'you', 'kafi', 'gai', 'rhy', 'kuch', 'jata', 'aye', 'ya', 'dono', 'hoa', 'aese', 'de', 'wohi', 'jati', 'jb', 'krta', 'lg', 'rahi', 'hui', 'karna', 'krna', 'gi', 'hova', 'yehi', 'jana', 'jye', 'chal', 'mil', 'tu', 'hum', 'par', 'hay', 'kis', 'sb', 'gy', 'dain', 'krny', 'tou') stopwords <- c(stopwords, stopwords()) doc1 <- tm_map(doc1, removeWords, stopwords)
Any thoughts!! Your textmineR package is my goto topic modeling package for all of the government I have done. Thank you!
Sincerely,
tom
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TommyJones/textmineR/issues/82, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGCQAVXOKJEGWGPZAGVHLLTDZEGFANCNFSM4ZHBVOAA .
Hi Tommy, Thank you!! That worked perfectly!! Dumb question for you: how do I assign the topics to all of the documents in the original dataset? How are you doing? Thank you again!! Sincerely, tom
-----Original Message----- From: Tommy Jones @.> To: TommyJones/textmineR @.> Cc: theiman112860 @.>; Author @.> Sent: Mon, Mar 15, 2021 2:57 pm Subject: Re: [TommyJones/textmineR] custom stop word list (#82)
Hi, Tom. Indeed textmineR supports this! You can pass your stopword list to
the stopword_vec
argument in CreateDtm
or CreateTcm
. If, for some
reason, you don't want any stopwords removed, you can pass an empty vector,
i.e., stopword_vec = c()
.
On Mon, Mar 15, 2021 at 2:07 PM theiman112860 @.***> wrote:
Hi Tommy,
Is it possible to use a custom stop word list? I am doing some analysis on Romanized Urdu and found a custom stop word list from the Python based NLTK package If so, how would I do it? I can do it using the tm package like so:
obtained a list of Urdu stopwords from the python NLTK package
stopwords<- c('ai', 'ayi', 'hy', 'hai', 'main', 'ki', 'tha', 'koi', 'ko', 'sy', 'woh ', 'bhi', 'aur', 'wo', 'yeh', 'rha', 'hota', 'ho', 'ga', 'ka', 'le', 'lye ', 'kr', 'kar', 'lye', 'liye', 'hotay', 'waisay', 'gya', 'gaya', 'kch', 'ab', 'thy', 'thay', 'houn', 'hain', 'han', 'to', 'is', 'hi', 'jo', 'kya', 'thi', 'se', 'pe', 'phr', 'wala', 'waisay', 'us', 'na', 'ny', 'hun', 'rha', 'raha', 'ja', 'rahay', 'abi', 'uski', 'ne', 'haan', 'acha', 'nai', 'sent', 'photo', 'you', 'kafi', 'gai', 'rhy', 'kuch', 'jata', 'aye', 'ya', 'dono', 'hoa', 'aese', 'de', 'wohi', 'jati', 'jb', 'krta', 'lg', 'rahi', 'hui', 'karna', 'krna', 'gi', 'hova', 'yehi', 'jana', 'jye', 'chal', 'mil', 'tu', 'hum', 'par', 'hay', 'kis', 'sb', 'gy', 'dain', 'krny', 'tou') stopwords <- c(stopwords, stopwords()) doc1 <- tm_map(doc1, removeWords, stopwords)
Any thoughts!! Your textmineR package is my goto topic modeling package for all of the government I have done. Thank you!
Sincerely,
tom
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TommyJones/textmineR/issues/82, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGCQAVXOKJEGWGPZAGVHLLTDZEGFANCNFSM4ZHBVOAA .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
You've got a few options.
theta
object in your model contains topics for each document used in trainingdtm <- CreateDtm(mydocuments)
model <- FitLdaModel(dtm = dtm,
k = 20,
iterations = 200, # I usually recommend at least 500 iterations or more
burnin = 180,
alpha = 0.1,
beta = 0.05)
View(model$theta) # document topic distributions here
The below options work for any documents, whether you used them for training or not.
Gibbs sampling is what is used to fit an LDA model. You can use the same process to predict topics for new (or old) documents, while keeping word topic distributions fixed.
dtm2 <- CreateDtm(newdocs)
# model is as above from FitLdaModel()
preds <- predict(model, dtm2, method = "gibbs", iterations = 200)
View(preds)
You might want to use the dot product method if you have a lot of documents. It's faster than Gibbs, but also creates noisier predictions.
dtm2 <- CreateDtm(newdocs)
# model is as above from FitLdaModel()
preds <- predict(model, dtm2, method = "dot")
View(preds)
Hi Tommy, Thank you!! Sincerely, tom
dtm2 <- CreateDtm(newdocs)
preds <- predict(model, dtm2, method = "gibbs", iterations = 200)
View(preds)
-----Original Message----- From: Tommy Jones @.> To: TommyJones/textmineR @.> Cc: theiman112860 @.>; Author @.> Sent: Mon, Mar 15, 2021 8:02 pm Subject: Re: [TommyJones/textmineR] custom stop word list (#82)
You've got a few options. Option 1: the theta object in your model contains topics for each document used in training dtm <- CreateDtm(mydocuments)
model <- FitLdaModel(dtm = dtm, k = 20, iterations = 200, # I usually recommend at least 500 iterations or more burnin = 180, alpha = 0.1, beta = 0.05)
View(model$theta) # document topic distributions here
Option 2: Use the predict method The below options work for any documents, whether you used them for training or not. 2a: Prediction with Gibbs sampling Gibbs sampling is what is used to fit an LDA model. You can use the same process to predict topics for new (or old) documents, while keeping word topic distributions fixed.dtm2 <- CreateDtm(newdocs)
preds <- predict(model, dtm2, method = "gibbs", iterations = 200)
View(preds)
2b: Predict using the dot product method You might want to use the dot product method if you have a lot of documents. It's faster than Gibbs, but also creates noisier predictions.dtm2 <- CreateDtm(newdocs)
preds <- predict(model, dtm2, method = "dot")
View(preds) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Awesome. Glad to help.
Hi Tommy,
Is it possible to use a custom stop word list? I am doing some analysis on Romanized Urdu and found a custom stop word list from the Python based NLTK package If so, how would I do it? I can do it using the tm package like so:
obtained a list of Urdu stopwords from the python NLTK package
stopwords<- c('ai', 'ayi', 'hy', 'hai', 'main', 'ki', 'tha', 'koi', 'ko', 'sy', 'woh ', 'bhi', 'aur', 'wo', 'yeh', 'rha', 'hota', 'ho', 'ga', 'ka', 'le', 'lye ', 'kr', 'kar', 'lye', 'liye', 'hotay', 'waisay', 'gya', 'gaya', 'kch', 'ab', 'thy', 'thay', 'houn', 'hain', 'han', 'to', 'is', 'hi', 'jo', 'kya', 'thi', 'se', 'pe', 'phr', 'wala', 'waisay', 'us', 'na', 'ny', 'hun', 'rha', 'raha', 'ja', 'rahay', 'abi', 'uski', 'ne', 'haan', 'acha', 'nai', 'sent', 'photo', 'you', 'kafi', 'gai', 'rhy', 'kuch', 'jata', 'aye', 'ya', 'dono', 'hoa', 'aese', 'de', 'wohi', 'jati', 'jb', 'krta', 'lg', 'rahi', 'hui', 'karna', 'krna', 'gi', 'hova', 'yehi', 'jana', 'jye', 'chal', 'mil', 'tu', 'hum', 'par', 'hay', 'kis', 'sb', 'gy', 'dain', 'krny', 'tou') stopwords <- c(stopwords, stopwords()) doc1 <- tm_map(doc1, removeWords, stopwords)
Any thoughts!! Your textmineR package is my goto topic modeling package for all of the government I have done. Thank you!
Sincerely,
tom