UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
274 stars 249 forks source link

New prototypes for flat #974

Closed nschneid closed 1 year ago

nschneid commented 1 year ago

The current flat guidelines give four kinds of headless structures: 1) names, 2) dates, 3) complex numerals, 4) foreign phrases. While some of these examples ("Hillary Rodham Clinton") are clearly correct, others might actually be amenable to a headed treatment instead, as has been the subject of a number of discussions (e.g. #455).

Here is a proposed alternative for discussion (this would follow the general definition of flat as a structure with no single head):

The prototypes for flat are:

  • (a) personal names (or parts thereof) that lack the hallmarks of general grammatical constructions in the language (e.g. "Hillary Rodham Clinton")
  • (b) foreign expressions that may be borrowed or quoted, but whose original grammatical structure is not necessarily accessible to speakers of the language(s) being annotated. "Foreign" includes not just natural languages but also notational systems that are considered external to natural language proper and are governed by separate rules (e.g., musical chord progressions, software code excerpts). Foreign status should additionally be indicated with the feature Foreign=Yes (the subtyped relation flat:foreign is not recommended).
  • (c) items that occur in an iconic sequence rather than in head-dependent or coordination relationships (e.g. "do re mi"), including onomatopoeia ("quack quack quack") and gibberish ("blargety blarg blarg")
  • (d) items separated into parts for readability (e.g., telephone numbers; contrast goeswith, which addresses improper spacing, and space-separated numerals like "1 000 000" which may be treated as single words)

What is considered to be transparent linguistic syntax (as opposed to flat structure) is subject to treebank-specific policies (e.g., some treebanks might provide proper grammatical analyses in the presence of code-switching, or treat mathematical notation as following linguistic strategies like predication).

The application of flat may extend beyond the prototypical cases to, e.g., various kinds of name and number expressions. However, even if an expression is idiosyncratic or follows a specialized pattern, every effort should be made to find a head rather than employing flat. If a head can be found but no substantive dependency relation is appropriate, dep can be used.

dan-zeman commented 1 year ago
  • numerals like "1 000 000"

I would not include this example in order to avoid confusion. It is actually a prototypical example (for me maybe the only example) of a legitimate word-with-spaces in languages like Czech or French. See the first paragraph here.

nschneid commented 1 year ago

Ah I couldn't recall whether there was already a policy on that. Updated.

nschneid commented 1 year ago

Should the part about onomatopoeia be qualified: "though lexicalized combinations (tick tock) may be treated as compounds"?

Stormur commented 1 year ago

I would suggest flat:redup for those.

nschneid commented 1 year ago

Closing in favor of #989, which incorporates the prototypes. Per the latest discussion there, we are not making any universal recommendation about specific subtypes, but languages may wish to use them.