Closed edemattos closed 1 year ago
In UD, each sentence should be in a single tree (with a single sent_id
). With spoken input, it is of course often unclear where are the sentence boundaries, but once you (or your sentence segmenter) decides there are multiple sentences in one dialog turn, they should be in separate trees.
As for the example "well we like it but we don't need it", I think it is a single sentence and a typical example of coordination of clauses, so I would use conj
for "need".
BTW: there should be no space between "do" and "n't" in the text
and SpaceAfter=No
in the MISC column, so it is more similar to the other (written language) English treebanks.
For deprels I think the question is what you are considering a grammatical sentence. goeswith
is not really appropriate as it signals an erroneous space in what should be a single word. "Well we like it but we don't need it" can be interpreted as a single well-integrated grammatical sentence, with the two parts linked by conj
. For two pieces loosely conjoined into one sentence, the correct relation is parataxis
. If you want to encode metadata within a single sentence, that can be done in the MISC column.
BTW: there should be no space between "do" and "n't" in the
text
andSpaceAfter=No
in the MISC column, so it is more similar to the other (written language) English treebanks.
Actually "don't" is now represented as a multiword token.
If it is important to preserve the original boundaries of the speech recognizer, and option would be a MISC feature analogous to SpaceAfter=No
, e.g. RawSentBoundary=No
. I agree with @martinpopel that if you believe things are really separate sentences, they should be encoded as such.
Actually "don't" is now represented as a multiword token.
Oh yes. I forgot this change I always called for, is finally there in UD 2.8. So
7-8 don't _ _ _ _ _ _ _ _
7 do do AUX VBP _ 9 aux _ _
8 n't not PART RB _ 9 advmod _ _
Sorry, I should have mentioned that the corpus (Switchboard NXT) has been automatically converted using an outdated script from 2014, which I have not yet been able to fully update. I will be sure to recover the surface forms and fix any other issues to comply with the latest UD version.
Also, I realize the example I've provided is not ideal because of the coordination. Here is a better one taken directly from the corpus (mind the length and other compliance errors for now):
# sent_id = sw4519_A_0035
# turn_id = t23
# text = uh and one way you know that is that only god can afford it
1 uh _ INTJ UH _ 8 reparandum _ A23|0|122.748|123.517625
2 and _ CCONJ CC _ 8 cc _ A23|0|123.517625|123.917625
3 one _ NUM CD _ 4 nummod _ A23|0|123.917625|124.097625
4 way _ NOUN NN _ 8 nsubj _ A23|0|124.097625|124.237625
5 you _ PRON PRP _ 6 nsubj _ A23|0|124.237625|124.357625
6 know _ VERB VBP _ 4 acl:relcl _ A23|0|124.357625|124.497625
7 that _ PRON DT _ 6 obj _ A23|0|124.497625|124.797625
8 is _ VERB VBZ _ 0 root _ A23|0|124.797625|125.177625
9 that _ SCONJ IN _ 13 mark _ A23|0|125.177625|125.305625
10 only _ ADV RB _ 11 advmod _ A23|0|125.357625|125.647625
11 god _ PROPN NNP _ 13 nsubj _ A23|0|125.647625|125.947625
12 can _ AUX MD _ 13 aux _ A23|0|125.947625|126.097625
13 afford _ VERB VB _ 8 ccomp _ A23|0|126.097625|126.417625
14 it _ PRON PRP _ 13 obj _ A23|0|126.417625|126.575375
# sent_id = sw4519_A_0036
# turn_id = t23
# text = uh so budget is not a problem for us
1 uh _ INTJ UH _ 7 reparandum _ A23|0|129.023375|129.984625
2 so _ ADV RB _ 7 advmod _ A23|0|131.360125|131.883375
3 budget _ NOUN NN _ 7 nsubj _ A23|0|133.042875|133.333375
4 is _ AUX VBZ _ 7 cop _ A23|0|133.333375|133.453375
5 not _ PART RB _ 7 advmod _ A23|0|133.453375|133.613375
6 a _ DET DT _ 7 det _ A23|0|133.613375|133.653375
7 problem _ NOUN NN _ 0 root _ A23|0|133.653375|133.993375
8 for _ ADP IN _ 9 case _ A23|0|133.993375|134.123375
9 us _ PRON PRP _ 7 obl _ A23|0|134.123375|134.533375
# sent_id = sw4519_A_0037
# turn_id = t23
# text = uh at least it has n't been
1 uh _ INTJ UH _ 7 reparandum _ A23|0|135.111375|135.489375
2 at _ ADP IN _ 7 advmod _ A23|0|-1|-1
3 least _ ADJ JJS _ 2 fixed _ A23|0|135.489375|135.739375
4 it _ PRON PRP _ 7 nsubj _ A23|0|135.739375|135.849375
5 has _ AUX VBZ _ 7 aux _ A23|0|135.849375|None
6 n't _ PART RB _ 7 advmod _ A23|0|None|136.169375
7 been _ VERB VBN _ 0 root _ A23|0|136.169375|136.319375
# sent_id = sw4519_A_0038
# turn_id = t23
# text = it may may be at this point
1 it _ PRON PRP _ 7 nsubj _ A23|0|136.319375|136.419375
2 may _ VERB MD _ 1 reparandum _ A23|1|136.419375|136.659375
3 may _ AUX MD _ 7 aux _ A23|0|136.799375|136.959375
4 be _ AUX VB _ 7 cop _ A23|0|136.959375|137.089375
5 at _ ADP IN _ 7 case _ A23|0|137.089375|137.209375
6 this _ DET DT _ 7 det _ A23|0|137.209375|137.429375
7 point _ NOUN NN _ 0 root _ A23|0|137.429375|137.71225
# sent_id = sw4519_A_0039
# turn_id = t23
# text = but uh up until this point it really has n't been
1 but _ CCONJ CC _ 11 cc _ A23|0|137.855375|138.23
2 uh _ INTJ UH _ 1 reparandum _ A23|0|138.23|138.527625
3 up _ ADP IN _ 6 case _ A23|0|138.719375|138.859375
4 until _ ADP IN _ 6 case _ A23|0|138.859375|139.089375
5 this _ DET DT _ 6 det _ A23|0|139.089375|139.30425
6 point _ NOUN NN _ 11 obl _ A23|0|139.30425|139.519375
7 it _ PRON PRP _ 11 nsubj _ A23|0|139.519375|139.589375
8 really _ ADV RB _ 11 advmod _ A23|0|139.589375|139.799375
9 has _ AUX VBZ _ 11 aux _ A23|0|139.799375|None
10 n't _ PART RB _ 11 advmod _ A23|0|None|140.119375
11 been _ VERB VBN _ 0 root _ A23|0|140.119375|140.322
We appreciate that UD requires sentences to be segmented into their own tree, but we are experimenting with segmentation and parsing spoken dialogue using both sentence-based and turn-based sequences. So, in addition to the UD scheme where we already know the sentence boundaries, we are also testing the case where the input would have to be one uninterrupted span:
# text = uh and one way you know that is that only god can afford it uh so budget is not a problem for us uh at least it has n't been it may may be at this point but uh up until this point it really has n't been
One comment is that I don't think "uh" should be reparandum
, which is intended for an element overridden by a speech repair. It seems to me like a discourse marker (holding the floor), so I'd probably use discourse
.
Other compliance issues aside, if you want to see the five segments as one tree, attach the roots of segments 2-5 to the root of segment 1 via the parataxis
relation. You can also define a subtype for this, e.g., parataxis:turn
.
To add to the option @dan-zeman suggested, if you want to keep the sentence splits but also represent turns for a downstream tool trained on the data, you can also use the speaker
and addressee
annotations to indicate turn changes. For example:
# speaker = Lenore
# addressee = MaeLynne
# text = So you don't need to go borrow equipment from anybody, to to do the feet?
1 So so INTJ UH _ 5 discourse 5:discourse Discourse=question:1->10
2 you you PRON PRP Case=Nom|Number=Sing|Person=2|PronType=Prs 5 nsubj 5:nsubj|7:nsubj:xsubj|8:nsubj:xsubj Entity=(person-1)
3-4 don't _ _ _ _ _ _ _ _
3 do do AUX VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 5 aux 5:aux _
4 n't not PART RB Polarity=Neg 5 advmod 5:advmod _
...
18 ? ? PUNCT . _ 5 punct 5:punct _
# sent_id = GUM_conversation_blacksmithing-2
# speaker = Lenore
# addressee = MaeLynne
# text = Do the hooves?
1 Do do VERB VB Person=2|VerbForm=Inf 0 root 0:root Discourse=elaboration:3->2|Entity=(event-4
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det Entity=(object-6
3 hooves hoof NOUN NNS Number=Plur 1 obj 1:obj Entity=event-4)object-6)|SpaceAfter=No
4 ? ? PUNCT . _ 1 punct 1:punct _
# sent_id = GUM_conversation_blacksmithing-3
# speaker = MaeLynne
# addressee = Lenore
# text = Well, we're gonna have to find somewhere, to get, something
1 Well well INTJ UH _ 5 discourse 5:discourse Discourse=antithesis:4->9|SpaceAfter=No
2 , , PUNCT , _ 1 punct 1:punct _
3 we we PRON PRP Case=Nom|Number=Plur|Person=1|PronType=Prs 5 nsubj 5:nsubj|7:nsubj:xsubj|9:nsubj:xsubj Entity=(person-7)|SpaceAfter=No
4 're be AUX VBP Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin 5 aux 5:aux _
5-6 gonna _ _ _ _ _ _ _ _
5 gon go VERB VBG Tense=Pres|VerbForm=Part 0 root 0:root _
6 na to PART TO _ 7 mark 7:mark _
@edemattos I think that the initial sentence segmentation that has been adopted is fine from the syntactic point of view and conform to the choices made in the two UD treebanks of spoken language we have developed: UD_French-Spoken and UD_Naija-NSC. (Naija is an English-based pidgincreole spoken by more than 100M people in Nigeria. The treebank is glossed and translated in English.) In some case we decided to gather two sentences and we used parataxis:conj
(rather than parataxis:turn
). I find problematic to gather all the sentences of speech turn, because some speech turn can be quite long. There is also a lot of different paratactic relations in spoken languages and we added other subtypes for parataxis
:
http://match.grew.fr/?corpus=UD_Naija-NSC@2.8&custom=60e429cc852e5&clustering=e.2
The paper you mention does not contain very useful information concerning the annotation of spoken languages. I advice you to look at the existing treebanks and to see how they have been annotated. You will find many publications concerning the annotation of spoken languages on my webpage, including chapters of a book we published in 2018 on our spoken French treebank (developed before the rising of UD, initially called Rhapsodie, which is now UD_French-Spoken).
I confirm that uh must be discourse. We extended the discourse relation to verbal expressions such as I mean
, you know,
etc.: http://match.grew.fr/?corpus=UD_French-Spoken@2.8&custom=60e42a526e044&clustering=DEP.upos
Note that our treebank are annotated using the SUD annotation scheme and automatically converted to UD. Our guidelines are available here: https://surfacesyntacticud.github.io/guidelines/u/. They includes some specific relations for spoken languages. For the numerous lists that are not coordination, we use the relation conj:dicto
(opposed to conj:coord
). It is translated in UD into reparandum
, but it is not always convenient, because lot of them are reformulations, and not necessary repairs, and the frontier between coordination and reformulation can be tenuous (It's a corpus…(uh) (or) (maybe) a treebank).
We have converted a corpus of transcribed speech from Penn Treebank to UD, and we would like to represent contiguous sentences that occur as part of a dialogue turn in order to faithfully represent the output transcription of a speech recognition system. For example, given the following two sentences:
we would like to be able to represent them as a single turn, something like:
Is there any advice on how to achieve this? We have found a recent annotation scheme building on UD for spoken dialogue (Davidson et al., 2019) but it seems they do not consider such a case.
I think any existing coordination relation would be inappropriate since it would blur the line between true coordination and actual segmented sentences. So, above I've simply replaced the root of the second sentence with a
goeswith:turn
relation coming directly from the root of the first sentence. Any subsequent sentence in the same turn would be a dependent of the preceding sentence. In the example above, a third sentence's root would have head 9.Perhaps there is a better, more canonical way of going about this? Any guidance is appreciated!