bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

signal words #29

Closed pgrandinetti closed 5 years ago

pgrandinetti commented 5 years ago

Is there a built-in way to localize signal words in udpipe? The problem is they can be made of multiple tokens, see https://lincs.ed.gov/readingprofiles/PF_Signal_Words.htm

jwijffels commented 5 years ago

not directly but you can use the upos/xpos and morphological features to extract words of interest and use keywords_phrasesto find multi-word expressions if any sequence you like. Once you have the multi-word compound expressions consisting of multiple tokens, you can use txt_recode_ngram to add them to the data.frame. See an example of this in the help of txt_recode_ngram: ?txt_recode_ngram

jwijffels commented 5 years ago

FYI. You can see to the links provided in #31 for documentation of all the upos/xpos/morphological features/dependency relations which you can use for construction whichever combination of features you like.