bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Bylee as a lemma #32

Closed reisner closed 5 years ago

reisner commented 5 years ago

I see the word bylee coming up as a lemma for some pieces of text. Example:

> library(udpipe)
> udmodel = udpipe_download_model(language = "english")
> x <- udpipe(x = "bylaw 1234 - a bylaw is a thing", object = udmodel)
> x[1,]
  doc_id paragraph_id sentence_id                        sentence start end
1   doc1            1           1 bylaw 1234 - a bylaw is a thing     1   5
  term_id token_id token lemma upos xpos       feats head_token_id dep_rel deps
1       1        1 bylaw bylee NOUN   NN Number=Sing             8   nsubj <NA>
  misc
1 <NA>
jwijffels commented 5 years ago

To understand why this happens, you need to understand how UDPipe models are built. They have been built on the training data from universal dependencies http://universaldependencies.org. In this case, by specifying english, you downloaded a model built on version 2.0 of the EWT treebank which is this dataset as specified here: https://github.com/UniversalDependencies/UD_English-EWT (you can see that it is collected on weblogs, newsgroups, emails, reviews, and Yahoo! answers, like you can see the provenance of every treebank (and I really advise you to have a look at universaldependencies.org to see if the treebank which is used to build the model is collected on similar data than you have).

Based on this data a model is constructed as follows (copied from the UDPipe paper):

A guesser produces (lemma rule, UPOS) pairs, where the lemma rule generates a lemma from a word by stripping some prefix and suffix and prepending and appending new prefix and suffix. To generate correct lemma rules, the guesser generates the results not only according to the last four characters of a word, but also using word prefix. Disambiguation is performed by an averaged perceptron tagger to see which of the rules should be used.

That's right, it's a statistical machine learning model, not a lookup table as lookup tables for lemmatisation are pretty bad performers. So that means, nothing is perfect. Indeed, you can even see the accuracy statistics of each part of the the model (tokenisation/sentence demarcation/pos tagging/lemmatisation/dependency parsing), they are provided in a link of the README of this R package. For your case, lemmatisation on English EWT, you'll see there 96-97% correctness on terms of the test data of that universal dependencies dataset. Bylee or whatever bylaw means or whichever sentence containing the word bylaw, might not be in that training dataset from universal dependencies. Either way, it just applies the perceptron tagger, which in your case gives bylee. So if you want to improve lemmatisation beyond 96-97% correctness, either do your own extra post-processing step or invest time in constructing something similar as a training dataset of universal dependencies, build a udpipe model on it and use that model.

reisner commented 5 years ago

@jwijffels Thanks for the info. I wasn't sure if there was room for adding lookup tables as a post-processing step over the learned model. I'll close the issue.