Multiple word spans - Githubissues

Eugen2525 commented 5 years ago

So when I use parser by https://github.com/tdozat/Parser-v2 which has succesfully passed the evaluation, I am getting an error on multi word spans if I evaluate by the official script http://universaldependencies.org/conll17/eval.zip . Example: In the 2nd sentence of test set we have a sentence like:

1 On on PRON Gender=Masc|Number=Sing|Person=3 2 nsubj 2 pourra pouvoir VERB Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin 0 root 3 toujours toujours ADV 2 advmod 4 parler parler VERB VerbForm=Inf 2 xcomp 5 à à ADP 8 case 6 propos propos NOUN Gender=Masc|Number=Sing 5 fixed 7 d' de ADP 5 fixed SpaceAfter=No 8 Averroès Averroès PROPN 4 obl:mod 9 de de ADP 11 case 10 " " PUNCT 11 punct SpaceAfter=No 11 décentrement décentrement NOUN Gender=Masc|Number=Sing 4 obl:arg 12-13 du 12 de de ADP 14 case 13 le le DET Definite=Def|Gender=Masc|Number=Sing|PronType=Art 14 det 14 Sujet sujet NOUN Gender=Masc|Number=Sing 11 nmod SpaceAfter=No 15 " " PUNCT 11 punct SpaceAfter=No 16 . . PUNCT 2 punct _

where 12-13 is a token "du" which is decomposed into "de" and "le" and these two tokens should be parsed. Thus the output of the parser is:

1 On on PRON 4 nsubj 2 pourra pouvoir AUX 4 aux 3 toujours toujours ADV 4 advmod 4 parler parler VERB 0 root 5 à à ADP 6 case 6 propos propos NOUN 4 obl 7 d' de ADP 8 case 8 Averroès Averroès PROPN 6 nmod 9 de de ADP 11 case 10 " " PUNCT 11 punct 11 décentrement décentrement NOUN 6 nmod 12 de de ADP 14 case 13 le le DET 14 det 14 Sujet sujet NOUN 11 nmod 15 " " PUNCT 11 punct 16 . . PUNCT 4 punct

where we have the entry 12-13 missing and which is causing error for the evaluation script like below:

First 20 differing characters in gold file: 'uSujet".«Ilaétéla' and system file: 'eleSujet".«Ilaété'

So could this be dealt?

jnivre commented 5 years ago

This looks like a problem with the parser, rather than the UD resources, so I think you should contact the parser developer.

dseddah commented 5 years ago

Hi, I think I recall this parser used udpipe for tokenization and word segmentation. You are not supposed to parse already segmented input (de+le) and expect the parser to recover the contracted form (du).

If you want the full chain, it's probably better to use the last stanford python nlp they just released.

Djamé

Eugen2525 commented 5 years ago

Thanks for your valuable comments, I think I can solve the issue with multi word spans.

Another issue came up however. The number of sentences of this french treebank in input UD pipe version and gold versions contain different amount of sentences as is found here: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2184

For example the following sentence is a single sentence in UD pipe but two separate sentences in the gold file: très bon accueil les chambres sont très agréables et spacieuses bon rapport qualité prix un café 17 euros !

So maybe this question is not to you but maybe just to document the issue.

Thanks

dseddah commented 5 years ago

Hi, you need to get back to the definition of the conll UD shared task 2017 and 2018. It was about end to end parsing with no gold tokenization, morphology and segmentation provided. It’s perfectly normal to have a mismatch at all levels between gold and predicted data, even if it was provided by udpipe.

Best, Djamé

Le 31 janv. 2019 à 16:13, Eugen notifications@github.com a écrit :

Thanks for your valuable comments, I think I can solve the issue with multi word spans.

Another issue came up however. The number of sentences of this french treebank in input UD pipe version and gold versions contain different amount of sentences as is found here: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2184

For example the following sentence is a single sentence in UD pipe but two separate sentences in the gold file: très bon accueil les chambres sont très agréables et spacieuses bon rapport qualité prix un café 17 euros !

So maybe this question is not to you but maybe just to document the issue.

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

UniversalDependencies / UD_French-GSD

Multiple word spans #8