datquocnguyen / RDRPOSTagger

A fast and accurate POS and morphological tagging toolkit (EACL 2014)
http://rdrpostagger.sourceforge.net
Other
140 stars 48 forks source link

Single double quote, ", becomes two single quotes in output. #16

Closed matgrioni closed 7 years ago

matgrioni commented 7 years ago

When there is a double quote in the source, the output consists of two single quotes. The line in question is

multos aut affectatio alienae fortunae aut suae querella querella Madvig : qua A : cura Haase . detinuit ; plerosque nihil certum sequentis vaga et inconstans et sibi displicens levitas per nova consilia iactavit ; quibusdam nihil , quo cursum derigant , placet , sed marcentis oscitantisque fata deprendunt , adeo ut quod apud maximum poetarum more oraculi dictum est , verum esse non dubitem : " Exigua pars est vitae , qua vivimus .  " Ceterum quidem omne spatium non vita sed tempus est .

whose tokens have been space separated ;). In the output the double quote after dubitem is two single quotes, which is a problem for my purposes.

matgrioni commented 7 years ago

Again, I am using Latin and my model is the UD_Latin POS DICT and RDR file. There was no " PUNCT rule in the DICT file, but after adding the same problem persists.

datquocnguyen commented 7 years ago

It is not a problem. I follow Penn Treebank standard where a two single quotation mark '' is used instead of a double quotation". You can post-process the output for your purpose. You might also want to use the model in UD_Latin-ITTB where we have more training data than UD_Latin (they are just two different Latin datasets), leading to better results. UD_Latin 81.72% UD_Latin-ITTB 96.87%