UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
203 stars 43 forks source link

Problem with conllu_to_conll.pl and restore_conllu_lines.pl files #61

Closed alirezamshi-zz closed 4 years ago

alirezamshi-zz commented 4 years ago

Hello, I think there is a bug with conllu_to_conll.pl and restore_conllu_lines.pl. Here is the code that I run for Swedish: perl conllu_to_conllx.pl < sv_talbanken-ud-test.conllu > sv_talbanken-ud-test.conll Then I convert it back to 'conllu' format: perl restore_conllu_lines.pl sv_talbanken-ud-test.conll sv_talbanken-ud-test.conllu > sv_talbanken-ud-test.conllu.merged

Then, I run the UD official evaluation script for "sv_talbanken-ud-test.conllu.merged" and "sv_talbanken-ud-test.conllu", but the code crashed with the following error:

main.UDError: The concatenation of tokens in gold file and in system file differ! First 20 differing characters in gold file: 'kbasbeloppetvidsamma' and system file: '_kbasbeloppetvidsamm'

The same thing happened with "tr-imst-ud-test.conllu" and "ru_syntagrus-ud-test.conllu".

dan-zeman commented 4 years ago
conllu_to_conllx.pl < sv_talbanken-ud-test.conllu > test.conll
restore_conllu_lines.pl test.conll sv_talbanken-ud-test.conllu > test.conllu
conllu-align-tokens.pl sv_talbanken-ud-test.conllu test.conllu > /dev/null
Non-whitespace character mismatch. Gold line no. 86, offset 321, buffer 'sk'. System line no. 86, buffer 's_k'. at /net/work/people/zeman/unidep/tools/conllu-align-tokens.pl line 112.
dan-zeman commented 4 years ago

Hmm, the problem is that restoring the extra CoNLL-U lines from the original CoNLL-U file is not enough. CoNLL-U allows words with spaces while CoNLL-X does not. Therefore, conllu_to_conll.pl replaces word-internal spaces with underscores. But restore_conllu_lines.pl only adds the extra lines and it does not try to fix the FORM field of the token lines. In such cases, the output file is not even valid CoNLL-U because the # text line does not match the FORM column.

I'll see whether I can make restore_conllu_lines.pl deal with this properly.