TommyJones / textmineR

An aid for text mining in R, with a syntax that should be familiar to experienced R users. Provides a wrapper for several topic models that take similarly-formatted input and give similarly-formatted output. Has additional functionality for analyzing and diagnostics for topic models.
Other
106 stars 34 forks source link

custom stop word list #82

Closed theiman112860 closed 3 years ago

theiman112860 commented 3 years ago

Hi Tommy,

Is it possible to use a custom stop word list? I am doing some analysis on Romanized Urdu and found a custom stop word list from the Python based NLTK package If so, how would I do it? I can do it using the tm package like so:

obtained a list of Urdu stopwords from the python NLTK package

stopwords<- c('ai', 'ayi', 'hy', 'hai', 'main', 'ki', 'tha', 'koi', 'ko', 'sy', 'woh ', 'bhi', 'aur', 'wo', 'yeh', 'rha', 'hota', 'ho', 'ga', 'ka', 'le', 'lye ', 'kr', 'kar', 'lye', 'liye', 'hotay', 'waisay', 'gya', 'gaya', 'kch', 'ab', 'thy', 'thay', 'houn', 'hain', 'han', 'to', 'is', 'hi', 'jo', 'kya', 'thi', 'se', 'pe', 'phr', 'wala', 'waisay', 'us', 'na', 'ny', 'hun', 'rha', 'raha', 'ja', 'rahay', 'abi', 'uski', 'ne', 'haan', 'acha', 'nai', 'sent', 'photo', 'you', 'kafi', 'gai', 'rhy', 'kuch', 'jata', 'aye', 'ya', 'dono', 'hoa', 'aese', 'de', 'wohi', 'jati', 'jb', 'krta', 'lg', 'rahi', 'hui', 'karna', 'krna', 'gi', 'hova', 'yehi', 'jana', 'jye', 'chal', 'mil', 'tu', 'hum', 'par', 'hay', 'kis', 'sb', 'gy', 'dain', 'krny', 'tou') stopwords <- c(stopwords, stopwords()) doc1 <- tm_map(doc1, removeWords, stopwords)

Any thoughts!! Your textmineR package is my goto topic modeling package for all of the government I have done. Thank you!

Sincerely,

tom

TommyJones commented 3 years ago

Hi, Tom. Indeed textmineR supports this! You can pass your stopword list to the stopword_vec argument in CreateDtm or CreateTcm. If, for some reason, you don't want any stopwords removed, you can pass an empty vector, i.e., stopword_vec = c().

On Mon, Mar 15, 2021 at 2:07 PM theiman112860 @.***> wrote:

Hi Tommy,

Is it possible to use a custom stop word list? I am doing some analysis on Romanized Urdu and found a custom stop word list from the Python based NLTK package If so, how would I do it? I can do it using the tm package like so:

obtained a list of Urdu stopwords from the python NLTK package

stopwords<- c('ai', 'ayi', 'hy', 'hai', 'main', 'ki', 'tha', 'koi', 'ko', 'sy', 'woh ', 'bhi', 'aur', 'wo', 'yeh', 'rha', 'hota', 'ho', 'ga', 'ka', 'le', 'lye ', 'kr', 'kar', 'lye', 'liye', 'hotay', 'waisay', 'gya', 'gaya', 'kch', 'ab', 'thy', 'thay', 'houn', 'hain', 'han', 'to', 'is', 'hi', 'jo', 'kya', 'thi', 'se', 'pe', 'phr', 'wala', 'waisay', 'us', 'na', 'ny', 'hun', 'rha', 'raha', 'ja', 'rahay', 'abi', 'uski', 'ne', 'haan', 'acha', 'nai', 'sent', 'photo', 'you', 'kafi', 'gai', 'rhy', 'kuch', 'jata', 'aye', 'ya', 'dono', 'hoa', 'aese', 'de', 'wohi', 'jati', 'jb', 'krta', 'lg', 'rahi', 'hui', 'karna', 'krna', 'gi', 'hova', 'yehi', 'jana', 'jye', 'chal', 'mil', 'tu', 'hum', 'par', 'hay', 'kis', 'sb', 'gy', 'dain', 'krny', 'tou') stopwords <- c(stopwords, stopwords()) doc1 <- tm_map(doc1, removeWords, stopwords)

Any thoughts!! Your textmineR package is my goto topic modeling package for all of the government I have done. Thank you!

Sincerely,

tom

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TommyJones/textmineR/issues/82, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGCQAVXOKJEGWGPZAGVHLLTDZEGFANCNFSM4ZHBVOAA .

theiman112860 commented 3 years ago

Hi Tommy, Thank you!! That worked perfectly!! Dumb question for you: how do I assign the topics to all of the documents in the original dataset? How are you doing? Thank you again!! Sincerely, tom

-----Original Message----- From: Tommy Jones @.> To: TommyJones/textmineR @.> Cc: theiman112860 @.>; Author @.> Sent: Mon, Mar 15, 2021 2:57 pm Subject: Re: [TommyJones/textmineR] custom stop word list (#82)

Hi, Tom. Indeed textmineR supports this! You can pass your stopword list to the stopword_vec argument in CreateDtm or CreateTcm. If, for some reason, you don't want any stopwords removed, you can pass an empty vector, i.e., stopword_vec = c().

On Mon, Mar 15, 2021 at 2:07 PM theiman112860 @.***> wrote:

Hi Tommy,

Is it possible to use a custom stop word list? I am doing some analysis on Romanized Urdu and found a custom stop word list from the Python based NLTK package If so, how would I do it? I can do it using the tm package like so:

obtained a list of Urdu stopwords from the python NLTK package

stopwords<- c('ai', 'ayi', 'hy', 'hai', 'main', 'ki', 'tha', 'koi', 'ko', 'sy', 'woh ', 'bhi', 'aur', 'wo', 'yeh', 'rha', 'hota', 'ho', 'ga', 'ka', 'le', 'lye ', 'kr', 'kar', 'lye', 'liye', 'hotay', 'waisay', 'gya', 'gaya', 'kch', 'ab', 'thy', 'thay', 'houn', 'hain', 'han', 'to', 'is', 'hi', 'jo', 'kya', 'thi', 'se', 'pe', 'phr', 'wala', 'waisay', 'us', 'na', 'ny', 'hun', 'rha', 'raha', 'ja', 'rahay', 'abi', 'uski', 'ne', 'haan', 'acha', 'nai', 'sent', 'photo', 'you', 'kafi', 'gai', 'rhy', 'kuch', 'jata', 'aye', 'ya', 'dono', 'hoa', 'aese', 'de', 'wohi', 'jati', 'jb', 'krta', 'lg', 'rahi', 'hui', 'karna', 'krna', 'gi', 'hova', 'yehi', 'jana', 'jye', 'chal', 'mil', 'tu', 'hum', 'par', 'hay', 'kis', 'sb', 'gy', 'dain', 'krny', 'tou') stopwords <- c(stopwords, stopwords()) doc1 <- tm_map(doc1, removeWords, stopwords)

Any thoughts!! Your textmineR package is my goto topic modeling package for all of the government I have done. Thank you!

Sincerely,

tom

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TommyJones/textmineR/issues/82, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGCQAVXOKJEGWGPZAGVHLLTDZEGFANCNFSM4ZHBVOAA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

TommyJones commented 3 years ago

You've got a few options.

Option 1: the theta object in your model contains topics for each document used in training

dtm <- CreateDtm(mydocuments)

model <- FitLdaModel(dtm = dtm, 
                     k = 20,
                     iterations = 200, # I usually recommend at least 500 iterations or more
                     burnin = 180,
                     alpha = 0.1,
                     beta = 0.05)

View(model$theta) # document topic distributions here

Option 2: Use the predict method

The below options work for any documents, whether you used them for training or not.

2a: Prediction with Gibbs sampling

Gibbs sampling is what is used to fit an LDA model. You can use the same process to predict topics for new (or old) documents, while keeping word topic distributions fixed.

dtm2 <- CreateDtm(newdocs)

# model is as above from FitLdaModel()

preds <- predict(model, dtm2, method = "gibbs", iterations = 200)

View(preds)

2b: Predict using the dot product method

You might want to use the dot product method if you have a lot of documents. It's faster than Gibbs, but also creates noisier predictions.

dtm2 <- CreateDtm(newdocs)

# model is as above from FitLdaModel()

preds <- predict(model, dtm2, method = "dot")

View(preds)
theiman112860 commented 3 years ago

Hi Tommy, Thank you!! Sincerely, tom

dtm2 <- CreateDtm(newdocs)

model is as above from FitLdaModel()

preds <- predict(model, dtm2, method = "gibbs", iterations = 200)

View(preds)

-----Original Message----- From: Tommy Jones @.> To: TommyJones/textmineR @.> Cc: theiman112860 @.>; Author @.> Sent: Mon, Mar 15, 2021 8:02 pm Subject: Re: [TommyJones/textmineR] custom stop word list (#82)

You've got a few options. Option 1: the theta object in your model contains topics for each document used in training dtm <- CreateDtm(mydocuments)

model <- FitLdaModel(dtm = dtm, k = 20, iterations = 200, # I usually recommend at least 500 iterations or more burnin = 180, alpha = 0.1, beta = 0.05)

View(model$theta) # document topic distributions here

Option 2: Use the predict method The below options work for any documents, whether you used them for training or not. 2a: Prediction with Gibbs sampling Gibbs sampling is what is used to fit an LDA model. You can use the same process to predict topics for new (or old) documents, while keeping word topic distributions fixed.dtm2 <- CreateDtm(newdocs)

model is as above from FitLdaModel()

preds <- predict(model, dtm2, method = "gibbs", iterations = 200)

View(preds)

2b: Predict using the dot product method You might want to use the dot product method if you have a lot of documents. It's faster than Gibbs, but also creates noisier predictions.dtm2 <- CreateDtm(newdocs)

model is as above from FitLdaModel()

preds <- predict(model, dtm2, method = "dot")

View(preds) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

TommyJones commented 3 years ago

Awesome. Glad to help.