UniversalDependencies / UD_Latin-PROIEL

Latin data from the PROIEL treebank.
Other
4 stars 0 forks source link

"expression" lemmas #1

Open martinpopel opened 7 years ago

martinpopel commented 7 years ago

In UD_Latin-PROIEL v2.0, about 0.5% of words have an artificial lemma. In train+dev, there are

For example,

19      esse    sum     AUX     V-      Tense=Pres|VerbForm=Inf|Voice=Act       20      cop     _       ref=1.1.2
20      ἀδύνατον        greek.expression        X       F-      _       17      xcomp   _       ref=1.1.2
21      Curium  Curius  PROPN   Ne      Case=Acc|Gender=Masc|Number=Sing        22      obj:dir _       ref=1.1.2
22      tribuniciis     tribunicius     ADJ     A-      Case=Abl|Degree=Pos|Number=Plur 21      amod    _       ref=1.1.1
23      a       calendar        ADV     Df      _       21      amod    _       ref=1.1.1
24      d       expression      ADV     Df      _       23      flat    _       ref=1.1.1
25      xvi     xvi     ADV     Df      _       23      flat    _       ref=1.1.1
26      Kalend  Kalend  ADV     Df      _       23      flat    _       ref=1.1.1
27      Sextilis        Sextilis        ADV     Df      _       23      flat    _       ref=1.1.1
15      HS      monetary        ADV     Df      _       14      advmod  _       ref=1.6.1
16      CCCIↃↃↃX̅X̅X̅      expression      ADV     Df      _       15      flat    _       ref=1.6.1

The guidelines say that "The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead; see format)." Moreover, the word form should be uniquely defined by the lemma and FEATS (except for capitalization and other orthographic synonyms). Thus I suggest

I admit, I feel a bit uneasy with the suggestion to use flat structure for all foreign phrases because in case of UD_Latin-PROIEL, it would mean a loss of information. Currently, some Greek words are annotated with the "correct" dependencies, e.g.:

9       ignoscendum     ignosco VERB    V-      Case=Acc|Gender=Neut|Number=Sing|VerbForm=Gdv   3       ccomp   _       ref=1.1.4
10      esse    sum     AUX     V-      Tense=Pres|VerbForm=Inf|Voice=Act       9       cop     _       ref=1.1.4
11      ἐπεὶ    greek.expression        X       F-      _       9       advmod  _       ref=1.1.4
12      οὐχ     greek.expression        X       F-      _       14      flat:foreign    _       ref=1.1.4
13      ἱερήϊον greek.expression        X       F-      _       14      obj:dir _       ref=1.1.4
14      οὐδὲ    greek.expression        X       F-      _       11      advmod  _       ref=1.1.4
15      βοεΐην  greek.expression        X       F-      _       14      obj:dir _       ref=1.1.4

Feel free to open a ''universal" issue to discuss the cases when the foreign phrase is expected to be understood by the readers, so it is rather a code switching. I think in such cases, we can keep the correct dependencies (and deprels) and just use Foreign=Yes. However, the current UD_Latin-PROIEL is not consistent in this, as shown in the example above - it uses flat:foreign, but only for some words in the Greek phrases and goes against the guidelines which prescribe that "all subsequent words in the expression are attached to the first one".

daghaug commented 7 years ago

Thanks for pointing this out. We will aim to fix the lemmas in the next release.

As for the code switching it is more difficult to decide how to handle it. We could open a universal issue on this if there are other treebanks with this problem. In the meantime I think it is ok to flatten the dependencies in the UD conversion, as the information is in any case preserved in the original treebank, whereas the UD version is more likely to be used for training parsers, in which case including Greek depencencies might just confuse the parser.

Dag

On 06/09/2017 11:01 AM, Martin Popel wrote:

In UD_Latin-PROIEL v2.0, about 0.5% of words have an artificial lemma. In train+dev, there are

  • 410 |greek.expression|
  • 149 |expression|
  • 138 |calendar|
  • 11 |monetary|
  • 9 |calendar.expression|
  • 3 |monetary.expression|

For example,

|19 esse sum AUX V- Tense=Pres|VerbForm=Inf|Voice=Act 20 cop ref=1.1.2 20 ἀδύνατον greek.expression X F- 17 xcomp ref=1.1.2 21 Curium Curius PROPN Ne Case=Acc|Gender=Masc|Number=Sing 22 obj:dir ref=1.1.2 |

|22 tribuniciis tribunicius ADJ A- Case=Abl|Degree=Pos|Number=Plur 21 amod ref=1.1.1 23 a calendar ADV Df 21 amod ref=1.1.1 24 d expression ADV Df 23 flat ref=1.1.1 25 xvi xvi ADV Df 23 flat ref=1.1.1 26 Kalend Kalend ADV Df 23 flat ref=1.1.1 27 Sextilis Sextilis ADV Df 23 flat _ ref=1.1.1 |

|15 HS monetary ADV Df 14 advmod ref=1.6.1 16 CCCIↃↃↃX̅X̅X̅ expression ADV Df 15 flat ref=1.6.1 |

The guidelines http://universaldependencies.org/u/overview/morphology.html#lemmas say that "The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead; see format)." Moreover, the word form should be uniquely defined by the lemma and FEATS (except for capitalization and other orthographic synonyms). Thus I suggest

I admit, I feel a bit uneasy with the suggestion to use flat structure for all foreign phrases because in case of UD_Latin-PROIEL, it would mean a loss of information. Currently, some Greek words are annotated with the "correct" dependencies, e.g.:

|9 ignoscendum ignosco VERB V- Case=Acc|Gender=Neut|Number=Sing|VerbForm=Gdv 3 ccomp ref=1.1.4 10 esse sum AUX V- Tense=Pres|VerbForm=Inf|Voice=Act 9 cop ref=1.1.4 11 ἐπεὶ greek.expression X F- 9 advmod ref=1.1.4 12 οὐχ greek.expression X F- 14 flat:foreign ref=1.1.4 13 ἱερήϊον greek.expression X F- 14 obj:dir ref=1.1.4 14 οὐδὲ greek.expression X F- 11 advmod ref=1.1.4 15 βοεΐην greek.expression X F- 14 obj:dir ref=1.1.4 |

Feel free to open a ''universal" issue https://github.com/universaldependencies/docs/issues to discuss the cases when the foreign phrase is expected to be understood by the readers, so it is rather a code switching https://en.wikipedia.org/wiki/Code-switching. I think in such cases, we can keep the correct dependencies (and deprels) and just use |Foreign=Yes|. However, the current UD_Latin-PROIEL is not consistent in this, as shown in the example above - it uses |flat:foreign|, but only for some words in the Greek phrases and goes against the guidelines which prescribe that "/all subsequent/ words in the expression are attached to the /first/ one".

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Latin-PROIEL/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AMS_l_b-XEAvrxQt8Mzbx8c-y0Q2wax6ks5sCQnVgaJpZM4N1FfQ.

nschneid commented 7 years ago

Regarding conventions for date and value expressions, see UniversalDependencies/docs#455