Certain tokens receive null POS and token is not outputted

datquocnguyen / RDRPOSTagger

A fast and accurate POS and morphological tagging toolkit (EACL 2014)

http://rdrpostagger.sourceforge.net

Other

140 stars 48 forks source link

Certain tokens receive null POS and token is not outputted #15

Closed matgrioni closed 7 years ago

matgrioni commented 7 years ago

For some tokens, such as âManli are tagged as:

''/null

I couldn't find anything on the documentation about a null POS being returned in any case, so I figure this is an undesirable feature. In either case, the inclusion of the original token would be necessary.

datquocnguyen commented 7 years ago

Thanks for the report. I cannot figure out any reason for what happened. It would be great if you can provide the whole sentence where âManli appears, and which language/model you are working on.

matgrioni commented 7 years ago

I am using the Latin model. Also after grepping and looking into the issue more it seems like many tokens receive null POS, even seemingly normal tokens such as ". I've attached a file that I am running this on, which has multiple instances of null POS when run through the tagger. This file has been converted to iso-8859-1 encoding for legacy reasons, but the same problem exists when it is in utf-8 format, although I'm not sure if it's the same tokens.

This is an example with ", the actual quote that is tagged as null is before cedant. I've included the context of the sentence in case the model uses it (I'm not sure).

nimis magna poena te consule constituta est sive malo poetae sive libero. 'scripsisti enim: "cedant arma togae."' quid tum? 'haec res tibi fluctus illos excitavit.'

Against_Lucius_Calpurnius_Piso.txt

datquocnguyen commented 7 years ago

Note that RDRPOSTagger requires an input tokenized/word-segmented corpus. You have to perform tokenization before performing POS tagging. Then you need also to add an entry '' PUNCT into file la_ittb-upos.DICT (here '' is two-single-quotations mark). This entry is important to RDRPOSTagger for working properly, but somehow the training Universal Dependencies data for Latin does not contain both '' and ". After you doing tokenization and adding the entry '' PUNCT (you can also add other entries for missing punctuations if you want!), RDRPOSTagger will definitely work, e.g. on the sentence ' scripsisti enim : " cedant arma togae . " '

matgrioni commented 7 years ago

Sorry, I was unclear. That file I attached is the raw file. I am tokenizing before it is being sent to the POS tagger and adding a space to separate any punctuation from the next character. I assume the fix will work the same independent of this fact however.

Thanks!

datquocnguyen commented 7 years ago

You can re-download rdrpostagger to run on the tokenized corpus. I just fix the errors by adding missing punctuation mark '' in the .DICT files. It should work with la_ittb-upos.DICT and la_ittb-upos.RDR.