bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Rake not capturing most frequent keywords #66

Closed rboga closed 3 years ago

rboga commented 4 years ago

Hi I have a million questions asked by consumers which I am analyzing. Close to 25% of them are to do with phrase "item number". But when i run Rake through it, I don't get that phrase at all in the list. Here is the code I am using. I can't figure what is going wrong. can you help.

anotated_df <- data.frame(udpipe_annotate(udmodel_english, data$question))

Using RAKE

stats <- keywords_rake(x = anotated_df, term = "lemma", group = "doc_id", relevant = anotated_df$upos %in% c("NOUN", "ADJ")) head(subset(stats, freq > 3), 26)

rdatasculptor commented 4 years ago

I am not sure but could this problem be related to the fact that these word are both nouns?

jwijffels commented 4 years ago

@rboga can you provide reproducible examples

rboga commented 4 years ago

Hi

Its corporate data and I am unable to share it. Can you help me identify where the bug might be?

when i check my base data,82868 questions have keyword "item number" in it out of 2 million questions.

what puzzles me, is the output looks like below. The freq is so less. something seems intrinsically wrong.

[image: image.png] code in R: base data file : "dat" anotated <- udpipe_annotate(udmodel_english, dat$question)) anotated_df <- data.frame(anotated)

sample anotated data: "anotated_df"

[image: image.png] length(dat$question[grep("item number",dat$question)])

82868 questions have keyword item number in it

stats <- keywords_rake(x = anotated_df, term = "lemma", group = "id", relevant = anotated_df$upos %in% c("NOUN", "ADJ"))

On Thu, Oct 10, 2019 at 1:52 AM jwijffels notifications@github.com wrote:

@rboga https://github.com/rboga can you provide reproducible examples

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bnosac/udpipe/issues/66?email_source=notifications&email_token=AD3Z43JNBCGONPP6I2YRTJDQN3GJPA5CNFSM4I7FM6K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA3BG2I#issuecomment-540414825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD3Z43NTNVJNGQXMW6K6MGTQN3GJPANCNFSM4I7FM6KQ .

jwijffels commented 4 years ago

@rboga If you want help, you need to provide a reproducible example (https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). If you want paid support for your data analysis, you can always request paid support at http://bnosac.be/index.php/contact/get-in-touch If you want free advise to do you analysis, the platform to ask such things is https://stackoverflow.com. If you want to report real bugs, you can ask here but chances are bigger if you provide a reproducible example, showing exactly what you think is wrong based on an example.

jwijffels commented 3 years ago

I'm closing this issue as no reproducible example is provided.