bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

keywords_phrases output #41

Closed Djgamelin closed 5 years ago

Djgamelin commented 5 years ago

This isn't an issue, rather just a question! I've been exploring the package and love it, and perhaps this is an oversight by me, but is there a way to output the doc_id when using keywords_phrases?

For example, the code below returns phrases from 'x', but ideally I'd like to associate each phrase and the corresponding doc_id can be found in. I'm not an R expert - so apologies if I'm overlooking something. ` s <- udpipe_annotate(udmodel_english, 'text data frame')

x <- data.frame(s)

Phrases <- keywords_phrases(x = x$phrase_tag, term = tolower(x$token), sep = " ", pattern = "(A|N)N(P+D(A|N)N)", is_regex = TRUE, detailed = TRUE)`

jwijffels commented 5 years ago

you get the start and end position of the keywords so that you can use it to index x Or you could decide to use the approach as indicated here: https://github.com/bnosac/udpipe/issues/36#issuecomment-441965564

Djgamelin commented 5 years ago

Thanks! That was helpful and seems to do the trick (data.table solution)... any idea if something similar can work for keywords_rake?

jwijffels commented 5 years ago

Use txt_recode_ngram in that case

Djgamelin commented 5 years ago

Ah sorry, is there no way to output doc_id when using keywords_rake? This would make the process much simpler - I'm looking to provide users with the ability to select a keyword based on an ngram value and rake score, and then view the corresponding documents.

jwijffels commented 5 years ago

as said, use txt_recode_ngramtogether with the result of your output of keywords_rake

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id", 
                          relevant = x$xpos %in% c("NN", "JJ"), sep = "-")
head(keywords)
x$term <- txt_recode_ngram(x$lemma, compound = keywords$keyword, ngram = keywords$ngram, sep = "-")
x$term <- ifelse(!x$term %in% keywords$keyword, NA, x$term)
Djgamelin commented 5 years ago

Perfect - exactly what I needed. Thank you for your time and patience! I greatly appreciate your help!