bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

ambiguities in the documentation? #21

Closed randomgambit closed 6 years ago

randomgambit commented 6 years ago

Hi @jwijffels ,

I was looking at your improved documentation and it looks really great.

Just a quick question if you have 2 min. In the keyword_phrases() function you are using the regex "(A|N)*N(P+D*(A|N)*N)* without going into too much details.

Can you just explain what does that mean exactly?

jwijffels commented 6 years ago

keyword_phrases looks for a sequence of words which have the following pattern: "(A|N)*N(P+D*(A|N)*N)*. The pattern is a regular expression which means find me all text which optionally beings with one or several adjectives or nouns, next you have a noun which is followed by a preposition an optional determiner(s), optionally one or several adjectives or nouns and next a noun.

Extracting these noun phrases is accomplished by recoding Parts of Speech tags to one of the following categories using the function as_phrasemachine

And next applying keywords_phrases

library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "fr")
x$phrase_tag <- as_phrasemachine(x$xpos, type = "penn-treebank")
nounphrases <- keywords_phrases(x$phrase_tag, term = x$token, 
                                pattern = "(A|N)+N(P+D*(A|N)*N)*", is_regex = TRUE, 
                                ngram_max = 4, 
                                detailed = TRUE)
head(nounphrases, 10)
randomgambit commented 6 years ago

thanks buddy!