UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
274 stars 249 forks source link

Null subjects #589

Open prokopidis opened 6 years ago

prokopidis commented 6 years ago

Hi,

Has there been any discussion on annotating null subjects? For example, in Greek as in other languages one can say:

# Πήρα το λεωφορείο
# took-1-Sg the bus
# Ξέχασα να τηλεφωνήσω
# forgot-1-Sg to call-1-Sg

Could a null node be used in the enhanced representation for such cases? The first example could be annotated as:

1   Πήρα    παίρνω  VERB    VbMn    Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Tense=Past|Voice=Act  0   root    _   _
1.1 null null _    _     Number=Sing|Person=1        _       _       1:subj      _
2   το  ο   DET AtDf    Case=Acc|Definite=Def|Gender=Neut|Number=Sing|PronType=Art  3   det _   _
3   λεωφορείο   λεωφορείο   NOUN    NoCm    Case=Acc|Gender=Neut|Number=Sing    1   obj _   _

The null subject could also be used for annotating control structures:

1   Ξέχασα  ξεχνώ   VERB    VbMn    Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Tense=Past|Voice=Act  0   root    _   _
1.1 null null _    _     Number=Sing|Person=1        _       _       1:subj|3:nsubj      _
2   να  να  PART    PtSj    ParticleType=Sub    3   aux _   _
3   τηλεφωνήσω  τηλεφωνώ    VERB    VbMn    Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Voice=Act 1   xcomp   _   _
dan-zeman commented 6 years ago

Technically the enhanced graph is capable of capturing elided subjects this way but the guidelines do not allow it (the last sentence here says: anything beyond the above additions is not part of the UD standard and should not be added to the officially released treebanks.) A new version of the guidelines would be needed for this to become part of the standard.

BTW if we discuss this addition in the future, it does not make sense to restrict it to subjects. Other core arguments can be omitted as well. As well as oblique arguments and obligatory adjuncts.

msklvsk commented 6 years ago

The subject here is encoded in the finitness of the verb (its person). xcomp without an Enhanced subject means its subject is the higher verb’s subject (or else it’s ccomp — upd: not sure: https://github.com/UniversalDependencies/docs/issues/200#issuecomment-439159175).

So far, no information is lost. But there are cases where the null subject is needed.

For the discussion on other elided non-predicates, there is https://github.com/UniversalDependencies/docs/issues/533.

adam-przepiorkowski commented 6 years ago

We have a discussion of this issue in publications accompanying the UD_Polish-LFG treebank (available from http://zil.ipipan.waw.pl/AdamPrzepiorkowski, search for “From Lexical Functional Grammar”), where we identify it as the main source of information loss in the conversion from the Polish LFG structure bank to the corresponding Polish UD treebank. Some of the examples discussed there involve non-subjects, including non-subject controllers. We strongly believe that the next version of the guidelines should lift this restriction.

Best, Adam P.

On Tue, 13 Nov 2018 at 17:43, Dan Zeman notifications@github.com wrote:

Technically the enhanced graph is capable of capturing elided subjects this way but the guidelines do not allow it (the last sentence here http://universaldependencies.org/u/overview/enhanced-syntax.html#additional-enhancements says: anything beyond the above additions is not part of the UD standard and should not be added to the officially released treebanks.) A new version of the guidelines would be needed for this to become part of the standard.

BTW if we discuss this addition in the future, it does not make sense to restrict it to subjects. Other core arguments can be omitted as well. As well as oblique arguments.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/589#issuecomment-438342681, or mute the thread https://github.com/notifications/unsubscribe-auth/AgsnubGtN3Gam-5_1GARAQPb6J6LMEpLks5uuvafgaJpZM4Yb_aD .

dseddah commented 6 years ago

Hi Dan, I was actually working under the assumption that the Enhanced guidelines were still a work on progress and thus subjected to change ?

Best, Djamé

dan-zeman commented 6 years ago

Hi Djamé, yes, I also think that enhanced guidelines should further evolve. But until that change happens, I suppose that the guidelines that are part of UD v2 are valid.

dseddah commented 6 years ago

I don't know if they are valid or not. I plead guilty, after last march workshop I didn't write up our proposal for enhanced dependencies + diathesis neutralization. @adam-przepiorkowski Maybe it would be simpler to use the new extended Conllu format and add whatever you want in an extended column ? 1-8 : surface UD 9 : Enhanced UD 10 : MISC 11 : Your own Enhanced UD where you fill out everything (not only the additional edges to Enhanced UD) so you would only need to insert that empty token before the verb (if that's possible with a 0.1 index)

msklvsk commented 6 years ago

@adam-przepiorkowski #533

Still, I think both examples don’t need a reconstructed subject. It is coreference layer’s job to mark an antecedent for a Person inside a verb. It’s not even necessary for some pro-drop languages to have an overt subject. You need a null subject to encode predication when otherwise it’s impossible to tell a phrase from a clause (besides general cases of non-predicate elision).

prokopidis commented 6 years ago

An overt pronominal subject is indeed redundant in the Greek examples above and, in many contexts, may even render the sentences weird for a native speaker. I think however that in order to annotate argument sharing between matrix and dependent clauses (as in Controlled/raised subjects) and coreference links, a null node could be handy on the basis of the current (ED) format specification. Actually, null nodes (or rather hidden nodes that become visible in specific annotation contexts in TrEd) is something that is being used for coreference annotation in the original Greek treebank. Relevant dependencies are currently lost during the conversion to UD. Thanks @msklvsk and @adam-przepiorkowski for the links to relevant discussions. I used an 1.1 index because, as @dseddah said, I do now know whether 0.1 is possible.

Stormur commented 6 years ago

Could you please make an example where a node for the null subject is necessary (or just useful) for such kinds of annotations and where dependencies are lost? Isn't all the necessary information already morphologically encoded in the verb, as @msklvsk says? By the way, I think that in general the definition of "pro-drop" is misleading, as in these languages there is actually no drop: on the contrary, the specification of a subject is an addition (for emphasis or other).

bulbulistan commented 6 years ago

By the way, I think that in general the definition of "pro-drop" is misleading, as in these languages there is actually no drop: on the contrary, the specification of a subject is an addition (for emphasis or other).

It's pernicious nonsense, that's what it is, along with stuff like "prepositional phrase" and "null subject". What's next, do we annotate XPs and CPs? I have no problem with individual treebanks using the misc column for whatever the authors of the treebank think is appropriate and useful, but UD should remain as theory neutral as possible. "null subject" is as far from theory-neutral as can be, unless of course you think that the terminology of GG/P&P/Minimalism is just the dominant metalanguage of modern linguistics which it is most definitely not (cf. https://dlc.hypotheses.org/1392).

prokopidis commented 6 years ago

@stormur, it would be useful for annotating, for example, co-reference.

Suppose we have

Η Μαρία άνοιξε την πόρτα
Maria open-3-sg-past the door

Μετά κρέμασε το παλτό της
Then hang-3-sg-past her coat

A null node attached to hang-3-sg-past would be used for linking της/her to. Then by linking the null node to Μαρία, a coreference chain can be created.

It would be also useful for annotating constructions like the Controlled/raised subjects of the current ED schema. For example, a graph similar to the "She seems to be reading a book" example of the ED guidelines cannot be built (or, rather, I do not know the way to create it in an annotation editor) for the following:

Μοιάζει να διαβάζει ένα βιβλίο
seem-3-sg-pres to read a book
dan-zeman commented 6 years ago

I agree that nodes representing arguments that are not overtly expressed by independent words would be useful for coreference resolution, but coreference resolution itself is not part of UD (not even enhanced UD). Like many other things that are useful but not part of UD, it can be annotated on top of UD, using either the MISC column, or the CoNLL-U Plus file format.

adam-przepiorkowski commented 6 years ago

@bulbulistan How is null subject theory-internal and null predicate theory neutral? Once UD makes it possible to represent the latter, I see no reason not to represent the former.

@dan-zeman Sharing of null dependents is not co-reference, it's (at least partially) a syntactic phenomenon. For example, co-reference makes it possible to switch from grammatical gender to natural gender (think of Mädchen in German or dziewczę in Polish – ‘girl’ in neuter gender, which later may be referred to by a feminine pronoun), dependent sharing doesn't.

bulbulistan commented 6 years ago

@adam-przepiorkowski:

  1. Are you talking about null nodes for elided predicates in ED? Because ellipsis is a very different phenomenon.
  2. Null-subject is something generativists came up with (according to some, it can be traced to Luigi Rizzi. Issues in Italian syntax. Dordrecht: Foris, 1982) and as such, it is tied to its theory of sentence production. If you want to use it typologically without declaring your allegiance to GG/P&P/Minimalism, well, I ain't stopping you, but I will disagree with you, because - and here's where the pernicious thing comes in - it's a misnomer, since nothing gets dropped. To use the example provided by @prokopidis: μοιάζει > -ει, there's your subject.
sylvainkahane commented 6 years ago

@prokopidis About your examples:

Η Μαρία άνοιξε την πόρτα
Maria open-3-sg-past the door

Μετά κρέμασε το παλτό της
Then hang-3-sg-past her coat

What is probably missing here is not a null subject. What could be missing is a morpheme-level analysis, where lexemes and inflectional morphemes are nodes and can receive an index or a coreference link. In fact, inflectional morphemes (such as 3 and sg in the examples) are syntactic units. Dependency analysis is usually done at the word-level for the sake of simplicity, but it is clear (at least for me ;) that words are not the minimal units of syntax.

dan-zeman commented 6 years ago

Well, UD is quite clear about not drawing dependencies between sub-word units :-)

Besides, it is not always straightforward how to isolate the morpheme that is responsible for cross-referencing the subject (or another argument). If I wanted to design an abstract representation that would account for arguments that are only cross-referenced on the verb OR not represented at all, I would find extra nodes a cleaner solution.