Open martinpopel opened 7 years ago
Thanks for pointing this out. We will aim to fix the lemmas in the next release.
As for the code switching it is more difficult to decide how to handle it. We could open a universal issue on this if there are other treebanks with this problem. In the meantime I think it is ok to flatten the dependencies in the UD conversion, as the information is in any case preserved in the original treebank, whereas the UD version is more likely to be used for training parsers, in which case including Greek depencencies might just confuse the parser.
Dag
On 06/09/2017 11:01 AM, Martin Popel wrote:
In UD_Latin-PROIEL v2.0, about 0.5% of words have an artificial lemma. In train+dev, there are
- 410 |greek.expression|
- 149 |expression|
- 138 |calendar|
- 11 |monetary|
- 9 |calendar.expression|
- 3 |monetary.expression|
For example,
|19 esse sum AUX V- Tense=Pres|VerbForm=Inf|Voice=Act 20 cop ref=1.1.2 20 ἀδύνατον greek.expression X F- 17 xcomp ref=1.1.2 21 Curium Curius PROPN Ne Case=Acc|Gender=Masc|Number=Sing 22 obj:dir ref=1.1.2 |
|22 tribuniciis tribunicius ADJ A- Case=Abl|Degree=Pos|Number=Plur 21 amod ref=1.1.1 23 a calendar ADV Df 21 amod ref=1.1.1 24 d expression ADV Df 23 flat ref=1.1.1 25 xvi xvi ADV Df 23 flat ref=1.1.1 26 Kalend Kalend ADV Df 23 flat ref=1.1.1 27 Sextilis Sextilis ADV Df 23 flat _ ref=1.1.1 |
|15 HS monetary ADV Df 14 advmod ref=1.6.1 16 CCCIↃↃↃX̅X̅X̅ expression ADV Df 15 flat ref=1.6.1 |
The guidelines http://universaldependencies.org/u/overview/morphology.html#lemmas say that "The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead; see format)." Moreover, the word form should be uniquely defined by the lemma and FEATS (except for capitalization and other orthographic synonyms). Thus I suggest
- Keep the lemma equal to the form in these cases.
- For foreign phrases, use the standard feature Foreign=Yes http://universaldependencies.org/u/feat/Foreign.html and if they span multiple words, use the flat http://universaldependencies.org/u/dep/flat.html#foreign-phrases deprel.
- For calendar and monetary expressions, design a language-specific guidelines which are consistent with the universal guidelines http://universaldependencies.org/u/dep/flat.html#dates-and-complex-numerals. (I think no change is needed here except for fixing the lemmas).
I admit, I feel a bit uneasy with the suggestion to use flat structure for all foreign phrases because in case of UD_Latin-PROIEL, it would mean a loss of information. Currently, some Greek words are annotated with the "correct" dependencies, e.g.:
|9 ignoscendum ignosco VERB V- Case=Acc|Gender=Neut|Number=Sing|VerbForm=Gdv 3 ccomp ref=1.1.4 10 esse sum AUX V- Tense=Pres|VerbForm=Inf|Voice=Act 9 cop ref=1.1.4 11 ἐπεὶ greek.expression X F- 9 advmod ref=1.1.4 12 οὐχ greek.expression X F- 14 flat:foreign ref=1.1.4 13 ἱερήϊον greek.expression X F- 14 obj:dir ref=1.1.4 14 οὐδὲ greek.expression X F- 11 advmod ref=1.1.4 15 βοεΐην greek.expression X F- 14 obj:dir ref=1.1.4 |
Feel free to open a ''universal" issue https://github.com/universaldependencies/docs/issues to discuss the cases when the foreign phrase is expected to be understood by the readers, so it is rather a code switching https://en.wikipedia.org/wiki/Code-switching. I think in such cases, we can keep the correct dependencies (and deprels) and just use |Foreign=Yes|. However, the current UD_Latin-PROIEL is not consistent in this, as shown in the example above - it uses |flat:foreign|, but only for some words in the Greek phrases and goes against the guidelines which prescribe that "/all subsequent/ words in the expression are attached to the /first/ one".
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Latin-PROIEL/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AMS_l_b-XEAvrxQt8Mzbx8c-y0Q2wax6ks5sCQnVgaJpZM4N1FfQ.
Regarding conventions for date and value expressions, see UniversalDependencies/docs#455
In UD_Latin-PROIEL v2.0, about 0.5% of words have an artificial lemma. In train+dev, there are
greek.expression
expression
calendar
monetary
calendar.expression
monetary.expression
For example,
The guidelines say that "The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead; see format)." Moreover, the word form should be uniquely defined by the lemma and FEATS (except for capitalization and other orthographic synonyms). Thus I suggest
I admit, I feel a bit uneasy with the suggestion to use flat structure for all foreign phrases because in case of UD_Latin-PROIEL, it would mean a loss of information. Currently, some Greek words are annotated with the "correct" dependencies, e.g.:
Feel free to open a ''universal" issue to discuss the cases when the foreign phrase is expected to be understood by the readers, so it is rather a code switching. I think in such cases, we can keep the correct dependencies (and deprels) and just use
Foreign=Yes
. However, the current UD_Latin-PROIEL is not consistent in this, as shown in the example above - it usesflat:foreign
, but only for some words in the Greek phrases and goes against the guidelines which prescribe that "all subsequent words in the expression are attached to the first one".