UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
269 stars 245 forks source link

Representing multiple sentences as one dialogue turn. #794

Closed edemattos closed 1 year ago

edemattos commented 3 years ago

We have converted a corpus of transcribed speech from Penn Treebank to UD, and we would like to represent contiguous sentences that occur as part of a dialogue turn in order to faithfully represent the output transcription of a speech recognition system. For example, given the following two sentences:

# sent_id = doc1_0001
# turn_id = t1
# text = well we like it
1   well    _   INTJ    UH  _   3   reparandum  _   _
2   we  _   PRON    PRP _   3   nsubj   _   _
3   like    _   VERB    VBP _   0   root    _   _
4   it  _   PRON    PRP _   3   obj _   _

# sent_id = doc1_0002
# turn_id = t1
# text = but we do n't need it
1   but _   CCONJ   CC  _   5   cc  _   _
2   we  _   PRON    PRP _   5   nsubj   _   _
3   do  _   AUX VBP _   5   aux _   _
4   n't _   PART    RB  _   5   advmod  _   _
5   need    _   VERB    VB  _   0   root    _   _
6   it  _   PRON    PRP _   5   obj _   _

we would like to be able to represent them as a single turn, something like:

# sent_id = doc1_t1
# text = well we like it but we do n't need it
1   well    _   INTJ    UH  _   3   reparandum  _   _
2   we  _   PRON    PRP _   3   nsubj   _   _
3   like    _   VERB    VBP _   0   root    _   _
4   it  _   PRON    PRP _   3   obj _   _
5   but _   CCONJ   CC  _   9   cc  _   _
6   we  _   PRON    PRP _   9   nsubj   _   _
7   do  _   AUX VBP _   9   aux _   _
8   n't _   PART    RB  _   9   advmod  _   _
9   need    _   VERB    VB  _   3   goeswith:turn   _   _
10  it  _   PRON    PRP _   9   obj _   _

Is there any advice on how to achieve this? We have found a recent annotation scheme building on UD for spoken dialogue (Davidson et al., 2019) but it seems they do not consider such a case.

I think any existing coordination relation would be inappropriate since it would blur the line between true coordination and actual segmented sentences. So, above I've simply replaced the root of the second sentence with a goeswith:turn relation coming directly from the root of the first sentence. Any subsequent sentence in the same turn would be a dependent of the preceding sentence. In the example above, a third sentence's root would have head 9.

Perhaps there is a better, more canonical way of going about this? Any guidance is appreciated!

martinpopel commented 3 years ago

In UD, each sentence should be in a single tree (with a single sent_id). With spoken input, it is of course often unclear where are the sentence boundaries, but once you (or your sentence segmenter) decides there are multiple sentences in one dialog turn, they should be in separate trees.

As for the example "well we like it but we don't need it", I think it is a single sentence and a typical example of coordination of clauses, so I would use conj for "need". BTW: there should be no space between "do" and "n't" in the text and SpaceAfter=No in the MISC column, so it is more similar to the other (written language) English treebanks.

nschneid commented 3 years ago

For deprels I think the question is what you are considering a grammatical sentence. goeswith is not really appropriate as it signals an erroneous space in what should be a single word. "Well we like it but we don't need it" can be interpreted as a single well-integrated grammatical sentence, with the two parts linked by conj. For two pieces loosely conjoined into one sentence, the correct relation is parataxis. If you want to encode metadata within a single sentence, that can be done in the MISC column.

nschneid commented 3 years ago

BTW: there should be no space between "do" and "n't" in the text and SpaceAfter=No in the MISC column, so it is more similar to the other (written language) English treebanks.

Actually "don't" is now represented as a multiword token.

nschneid commented 3 years ago

If it is important to preserve the original boundaries of the speech recognizer, and option would be a MISC feature analogous to SpaceAfter=No, e.g. RawSentBoundary=No. I agree with @martinpopel that if you believe things are really separate sentences, they should be encoded as such.

martinpopel commented 3 years ago

Actually "don't" is now represented as a multiword token.

Oh yes. I forgot this change I always called for, is finally there in UD 2.8. So

7-8 don't   _   _   _   _   _   _   _   _
7   do  do  AUX VBP _   9   aux _   _
8   n't not PART    RB  _   9   advmod  _   _
edemattos commented 3 years ago

Sorry, I should have mentioned that the corpus (Switchboard NXT) has been automatically converted using an outdated script from 2014, which I have not yet been able to fully update. I will be sure to recover the surface forms and fix any other issues to comply with the latest UD version.

Also, I realize the example I've provided is not ideal because of the coordination. Here is a better one taken directly from the corpus (mind the length and other compliance errors for now):

# sent_id = sw4519_A_0035
# turn_id = t23
# text = uh and one way you know that is that only god can afford it
1       uh      _       INTJ    UH      _       8       reparandum      _       A23|0|122.748|123.517625
2       and     _       CCONJ   CC      _       8       cc      _       A23|0|123.517625|123.917625
3       one     _       NUM     CD      _       4       nummod  _       A23|0|123.917625|124.097625
4       way     _       NOUN    NN      _       8       nsubj   _       A23|0|124.097625|124.237625
5       you     _       PRON    PRP     _       6       nsubj   _       A23|0|124.237625|124.357625
6       know    _       VERB    VBP     _       4       acl:relcl       _       A23|0|124.357625|124.497625
7       that    _       PRON    DT      _       6       obj     _       A23|0|124.497625|124.797625
8       is      _       VERB    VBZ     _       0       root    _       A23|0|124.797625|125.177625
9       that    _       SCONJ   IN      _       13      mark    _       A23|0|125.177625|125.305625
10      only    _       ADV     RB      _       11      advmod  _       A23|0|125.357625|125.647625
11      god     _       PROPN   NNP     _       13      nsubj   _       A23|0|125.647625|125.947625
12      can     _       AUX     MD      _       13      aux     _       A23|0|125.947625|126.097625
13      afford  _       VERB    VB      _       8       ccomp   _       A23|0|126.097625|126.417625
14      it      _       PRON    PRP     _       13      obj     _       A23|0|126.417625|126.575375

# sent_id = sw4519_A_0036
# turn_id = t23
# text = uh so budget is not a problem for us
1       uh      _       INTJ    UH      _       7       reparandum      _       A23|0|129.023375|129.984625
2       so      _       ADV     RB      _       7       advmod  _       A23|0|131.360125|131.883375
3       budget  _       NOUN    NN      _       7       nsubj   _       A23|0|133.042875|133.333375
4       is      _       AUX     VBZ     _       7       cop     _       A23|0|133.333375|133.453375
5       not     _       PART    RB      _       7       advmod  _       A23|0|133.453375|133.613375
6       a       _       DET     DT      _       7       det     _       A23|0|133.613375|133.653375
7       problem _       NOUN    NN      _       0       root    _       A23|0|133.653375|133.993375
8       for     _       ADP     IN      _       9       case    _       A23|0|133.993375|134.123375
9       us      _       PRON    PRP     _       7       obl     _       A23|0|134.123375|134.533375

# sent_id = sw4519_A_0037
# turn_id = t23
# text = uh at least it has n't been
1       uh      _       INTJ    UH      _       7       reparandum      _       A23|0|135.111375|135.489375
2       at      _       ADP     IN      _       7       advmod  _       A23|0|-1|-1
3       least   _       ADJ     JJS     _       2       fixed   _       A23|0|135.489375|135.739375
4       it      _       PRON    PRP     _       7       nsubj   _       A23|0|135.739375|135.849375
5       has     _       AUX     VBZ     _       7       aux     _       A23|0|135.849375|None
6       n't     _       PART    RB      _       7       advmod  _       A23|0|None|136.169375
7       been    _       VERB    VBN     _       0       root    _       A23|0|136.169375|136.319375

# sent_id = sw4519_A_0038
# turn_id = t23
# text = it may may be at this point
1       it      _       PRON    PRP     _       7       nsubj   _       A23|0|136.319375|136.419375
2       may     _       VERB    MD      _       1       reparandum      _       A23|1|136.419375|136.659375
3       may     _       AUX     MD      _       7       aux     _       A23|0|136.799375|136.959375
4       be      _       AUX     VB      _       7       cop     _       A23|0|136.959375|137.089375
5       at      _       ADP     IN      _       7       case    _       A23|0|137.089375|137.209375
6       this    _       DET     DT      _       7       det     _       A23|0|137.209375|137.429375
7       point   _       NOUN    NN      _       0       root    _       A23|0|137.429375|137.71225

# sent_id = sw4519_A_0039
# turn_id = t23
# text = but uh up until this point it really has n't been
1       but     _       CCONJ   CC      _       11      cc      _       A23|0|137.855375|138.23
2       uh      _       INTJ    UH      _       1       reparandum      _       A23|0|138.23|138.527625
3       up      _       ADP     IN      _       6       case    _       A23|0|138.719375|138.859375
4       until   _       ADP     IN      _       6       case    _       A23|0|138.859375|139.089375
5       this    _       DET     DT      _       6       det     _       A23|0|139.089375|139.30425
6       point   _       NOUN    NN      _       11      obl     _       A23|0|139.30425|139.519375
7       it      _       PRON    PRP     _       11      nsubj   _       A23|0|139.519375|139.589375
8       really  _       ADV     RB      _       11      advmod  _       A23|0|139.589375|139.799375
9       has     _       AUX     VBZ     _       11      aux     _       A23|0|139.799375|None
10      n't     _       PART    RB      _       11      advmod  _       A23|0|None|140.119375
11      been    _       VERB    VBN     _       0       root    _       A23|0|140.119375|140.322

We appreciate that UD requires sentences to be segmented into their own tree, but we are experimenting with segmentation and parsing spoken dialogue using both sentence-based and turn-based sequences. So, in addition to the UD scheme where we already know the sentence boundaries, we are also testing the case where the input would have to be one uninterrupted span:

# text = uh and one way you know that is that only god can afford it uh so budget is not a problem for us uh at least it has n't been it may may be at this point but uh up until this point it really has n't been
nschneid commented 3 years ago

One comment is that I don't think "uh" should be reparandum, which is intended for an element overridden by a speech repair. It seems to me like a discourse marker (holding the floor), so I'd probably use discourse.

dan-zeman commented 3 years ago

Other compliance issues aside, if you want to see the five segments as one tree, attach the roots of segments 2-5 to the root of segment 1 via the parataxis relation. You can also define a subtype for this, e.g., parataxis:turn.

amir-zeldes commented 3 years ago

To add to the option @dan-zeman suggested, if you want to keep the sentence splits but also represent turns for a downstream tool trained on the data, you can also use the speaker and addressee annotations to indicate turn changes. For example:

https://github.com/UniversalDependencies/UD_English-GUM/blob/master/not-to-release/sources/GUM_conversation_blacksmithing.conllu#L12-L55

# speaker = Lenore
# addressee = MaeLynne
# text = So you don't need to go borrow equipment from anybody, to to do the feet?
1   So  so  INTJ    UH  _   5   discourse   5:discourse Discourse=question:1->10
2   you you PRON    PRP Case=Nom|Number=Sing|Person=2|PronType=Prs  5   nsubj   5:nsubj|7:nsubj:xsubj|8:nsubj:xsubj Entity=(person-1)
3-4 don't   _   _   _   _   _   _   _   _
3   do  do  AUX VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin   5   aux 5:aux   _
4   n't not PART    RB  Polarity=Neg    5   advmod  5:advmod    _
...
18  ?   ?   PUNCT   .   _   5   punct   5:punct _

# sent_id = GUM_conversation_blacksmithing-2
# speaker = Lenore
# addressee = MaeLynne
# text = Do the hooves?
1   Do  do  VERB    VB  Person=2|VerbForm=Inf   0   root    0:root  Discourse=elaboration:3->2|Entity=(event-4
2   the the DET DT  Definite=Def|PronType=Art   3   det 3:det   Entity=(object-6
3   hooves  hoof    NOUN    NNS Number=Plur 1   obj 1:obj   Entity=event-4)object-6)|SpaceAfter=No
4   ?   ?   PUNCT   .   _   1   punct   1:punct _

# sent_id = GUM_conversation_blacksmithing-3
# speaker = MaeLynne
# addressee = Lenore
# text = Well, we're gonna have to find somewhere, to get, something
1   Well    well    INTJ    UH  _   5   discourse   5:discourse Discourse=antithesis:4->9|SpaceAfter=No
2   ,   ,   PUNCT   ,   _   1   punct   1:punct _
3   we  we  PRON    PRP Case=Nom|Number=Plur|Person=1|PronType=Prs  5   nsubj   5:nsubj|7:nsubj:xsubj|9:nsubj:xsubj Entity=(person-7)|SpaceAfter=No
4   're be  AUX VBP Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin   5   aux 5:aux   _
5-6 gonna   _   _   _   _   _   _   _   _
5   gon go  VERB    VBG Tense=Pres|VerbForm=Part    0   root    0:root  _
6   na  to  PART    TO  _   7   mark    7:mark  _
sylvainkahane commented 3 years ago

@edemattos I think that the initial sentence segmentation that has been adopted is fine from the syntactic point of view and conform to the choices made in the two UD treebanks of spoken language we have developed: UD_French-Spoken and UD_Naija-NSC. (Naija is an English-based pidgincreole spoken by more than 100M people in Nigeria. The treebank is glossed and translated in English.) In some case we decided to gather two sentences and we used parataxis:conj (rather than parataxis:turn). I find problematic to gather all the sentences of speech turn, because some speech turn can be quite long. There is also a lot of different paratactic relations in spoken languages and we added other subtypes for parataxis: http://match.grew.fr/?corpus=UD_Naija-NSC@2.8&custom=60e429cc852e5&clustering=e.2

The paper you mention does not contain very useful information concerning the annotation of spoken languages. I advice you to look at the existing treebanks and to see how they have been annotated. You will find many publications concerning the annotation of spoken languages on my webpage, including chapters of a book we published in 2018 on our spoken French treebank (developed before the rising of UD, initially called Rhapsodie, which is now UD_French-Spoken).

I confirm that uh must be discourse. We extended the discourse relation to verbal expressions such as I mean, you know, etc.: http://match.grew.fr/?corpus=UD_French-Spoken@2.8&custom=60e42a526e044&clustering=DEP.upos

Note that our treebank are annotated using the SUD annotation scheme and automatically converted to UD. Our guidelines are available here: https://surfacesyntacticud.github.io/guidelines/u/. They includes some specific relations for spoken languages. For the numerous lists that are not coordination, we use the relation conj:dicto (opposed to conj:coord). It is translated in UD into reparandum, but it is not always convenient, because lot of them are reformulations, and not necessary repairs, and the frontier between coordination and reformulation can be tenuous (It's a corpus…(uh) (or) (maybe) a treebank).