bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
214 stars 33 forks source link

Ignore underscore when annotating #7

Closed alanault closed 6 years ago

alanault commented 6 years ago

Hi there,

Thanks for the package - it's great!

I'm using the package to annotate upos - however, I'm pre-processing where I'm replacing specific terms with tokens. They're identified with an underscore so we know they're not the word. e.g. I love Nike > i love brand

However, when I run the annotation function, it processes the underscore as a symbol, rather than as a noun. Is there a way to make it ignore the underscores? I've read through the documentation, but couldn't find anything. Many thanks Alan

jwijffels commented 6 years ago

The solution to this is do your preprofessing after the annotation

alanault commented 6 years ago

Ah - thats a shame, my data is pre-processed, so adding features after POS tagging isn't possible.

jwijffels commented 6 years ago

Why don't you do the following on your annotated data set:

x$upos <- ifelse(x$token %in% c("your", "list", "of", "brands"), "NOUN", x$upos)

alanault commented 6 years ago

That's a nice (and simple) approach!

interestingly, the list of generalised tokens (like brand) is actually quite small, so it's not too much of an issue on a large corpus.

The token is also split into two the brand and a "" as well, which is labelled as punctuation. Probably worth clearing these out at the same time.

Thanks for your help and time!