ambiguities in the documentation?

randomgambit commented 6 years ago

Hi @jwijffels ,

I was looking at your improved documentation and it looks really great.

Just a quick question if you have 2 min. In the keyword_phrases() function you are using the regex "(A|N)*N(P+D*(A|N)*N)* without going into too much details.

Can you just explain what does that mean exactly?

jwijffels commented 6 years ago

keyword_phrases looks for a sequence of words which have the following pattern: "(A|N)*N(P+D*(A|N)*N)*. The pattern is a regular expression which means find me all text which optionally beings with one or several adjectives or nouns, next you have a noun which is followed by a preposition an optional determiner(s), optionally one or several adjectives or nouns and next a noun.

Extracting these noun phrases is accomplished by recoding Parts of Speech tags to one of the following categories using the function as_phrasemachine

A: adjective
C: coordinating conjuction
D: determiner
M: modifier of verb
N: noun or proper noun
P: preposition
O: other elements

And next applying keywords_phrases

library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "fr")
x$phrase_tag <- as_phrasemachine(x$xpos, type = "penn-treebank")
nounphrases <- keywords_phrases(x$phrase_tag, term = x$token, 
                                pattern = "(A|N)+N(P+D*(A|N)*N)*", is_regex = TRUE, 
                                ngram_max = 4, 
                                detailed = TRUE)
head(nounphrases, 10)

randomgambit commented 6 years ago

thanks buddy!

bnosac / udpipe

ambiguities in the documentation? #21