Closed matgrioni closed 7 years ago
Thanks for the report. I cannot figure out any reason for what happened. It would be great if you can provide the whole sentence where âManli appears, and which language/model you are working on.
I am using the Latin model. Also after grepping and looking into the issue more it seems like many tokens receive null
POS, even seemingly normal tokens such as "
. I've attached a file that I am running this on, which has multiple instances of null
POS when run through the tagger. This file has been converted to iso-8859-1 encoding for legacy reasons, but the same problem exists when it is in utf-8 format, although I'm not sure if it's the same tokens.
This is an example with "
, the actual quote that is tagged as null
is before cedant
. I've included the context of the sentence in case the model uses it (I'm not sure).
nimis magna poena te consule constituta est sive malo poetae sive libero. 'scripsisti enim: "cedant arma togae."' quid tum? 'haec res tibi fluctus illos excitavit.'
Note that RDRPOSTagger requires an input tokenized/word-segmented corpus. You have to perform tokenization before performing POS tagging. Then you need also to add an entry
'' PUNCT
into file la_ittb-upos.DICT
(here ''
is two-single-quotations mark). This entry is important to RDRPOSTagger for working properly, but somehow the training Universal Dependencies data for Latin does not contain both ''
and "
.
After you doing tokenization and adding the entry '' PUNCT
(you can also add other entries for missing punctuations if you want!), RDRPOSTagger will definitely work, e.g. on the sentence ' scripsisti enim : " cedant arma togae . " '
Sorry, I was unclear. That file I attached is the raw file. I am tokenizing before it is being sent to the POS tagger and adding a space to separate any punctuation from the next character. I assume the fix will work the same independent of this fact however.
Thanks!
You can re-download rdrpostagger to run on the tokenized corpus. I just fix the errors by adding missing punctuation mark ''
in the .DICT files. It should work with la_ittb-upos.DICT and la_ittb-upos.RDR.
For some tokens, such as
âManli
are tagged as:I couldn't find anything on the documentation about a
null
POS being returned in any case, so I figure this is an undesirable feature. In either case, the inclusion of the original token would be necessary.