Closed randomgambit closed 6 years ago
keyword_phrases
looks for a sequence of words which have the following pattern: "(A|N)*N(P+D*(A|N)*N)*
. The pattern is a regular expression which means find me all text which optionally beings with one or several adjectives or nouns, next you have a noun which is followed by a preposition an optional determiner(s), optionally one or several adjectives or nouns and next a noun.
Extracting these noun phrases is accomplished by recoding Parts of Speech tags to one of the following categories using the function as_phrasemachine
And next applying keywords_phrases
library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "fr")
x$phrase_tag <- as_phrasemachine(x$xpos, type = "penn-treebank")
nounphrases <- keywords_phrases(x$phrase_tag, term = x$token,
pattern = "(A|N)+N(P+D*(A|N)*N)*", is_regex = TRUE,
ngram_max = 4,
detailed = TRUE)
head(nounphrases, 10)
thanks buddy!
Hi @jwijffels ,
I was looking at your improved documentation and it looks really great.
Just a quick question if you have 2 min. In the
keyword_phrases()
function you are using the regex"(A|N)*N(P+D*(A|N)*N)*
without going into too much details.Can you just explain what does that mean exactly?