clarin-eric / parla-clarin

Schema for modelling parliamentary debates
https://clarin-eric.github.io/parla-clarin/
21 stars 6 forks source link

attributes start and end not allowed in seg element #12

Closed matyaskopp closed 3 years ago

matyaskopp commented 3 years ago

I need to create sentence <s> level synchronization with audio track but I probably found a bug in parla-clarin docs and no suitable solution of my issue...

https://github.com/clarin-eric/parla-clarin/blob/261f8cb50c4a7be796e8185f294b216d32cb1561/Schema/parla-clarin-odd.xml#L1618-L1620

According to P5 guidelines @start and @end are not allowed in <seg>: https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.timed.html

bansp commented 3 years ago

I have a customization that I was (still am) going to propose as a Feature Request to the TEI Council, via ISO MAF. @TomazErjavec -- if you'd like to use it in the Parla-Clarin ODD, I'd be super happy because that would add to the range of real-life cases that the standard encodes.

<classSpec xmlns="http://www.tei-c.org/ns/1.0" module="tei" type="atts" ident="att.offset" mode="add">

            <desc versionDate="2020-02-06" xml:lang="en">provides attributes for specifying the beginning
              and end of a linguistic or textual segment, by addressing the character offsets.</desc>
            <classes/>

            <constraintSpec ident="endPosnotstartPos" scheme="schematron"
              xmlns:sch="http://purl.oclc.org/dsdl/schematron">
              <!-- paraphrased from span.xml-->
              <constraint>
                <sch:rule context="*[@endPos]">
                  <sch:assert test="@startPos">If @endPos is supplied on <sch:name/>, @startPos must
                    be supplied as well</sch:assert>
                </sch:rule>
              </constraint>
            </constraintSpec>

            <constraintSpec ident="startPosnotendPos" scheme="schematron"
              xmlns:sch="http://purl.oclc.org/dsdl/schematron">
              <!-- paraphrased from span.xml-->
              <constraint>
                <sch:rule context="*[@startPos]">
                  <sch:assert test="@endPos">If @startPos is supplied on
                    <sch:name/>, @endPos must be supplied as well</sch:assert></sch:rule>
              </constraint>
            </constraintSpec>

            <constraintSpec ident="offsetBaseObligatory" scheme="schematron"
              xmlns:sch="http://purl.oclc.org/dsdl/schematron">
              <constraint>
                <sch:rule context="*[@endPos]"><sch:assert test="ancestor-or-self::*[@offsetBase]"
                    >offset attributes require @offsetBase to be defined in the same element or its
                    ancestor</sch:assert></sch:rule>
              </constraint>
            </constraintSpec>

            <attList>
              <attDef ident="offsetBase" usage="opt">
                <desc versionDate="2020-02-06" xml:lang="en">points at the element that forms the
                  basis for offset calculations in standoff annotations. An element using the
                    <att>startPos</att> and <att>endPos</att> attributes either has to define
                    <att>offsetBase</att> as well, or <att>offsetBase</att> should be defined on an
                  ancestor element.</desc>
                <datatype>
                  <dataRef key="teidata.pointer"/>
                </datatype>
              </attDef>
              <attDef ident="startPos" usage="opt">
                <desc versionDate="2020-02-06" xml:lang="en">specifies the starting point of a sequence of characters
                  or bytes, or of elements that can be pointed at with a URI.</desc>
                <datatype minOccurs="1" maxOccurs="1">
                  <dataRef key="teidata.count"/>
                </datatype>
              </attDef>
              <attDef ident="endPos" usage="opt">
                <desc versionDate="2020-02-06" xml:lang="en">specifies the end-point of a sequence of characters
                  or bytes, or of elements that can be pointed at with a URI.</desc>
                <datatype minOccurs="1" maxOccurs="1">
                  <dataRef key="teidata.count"/>
                </datatype>
              </attDef>
            </attList>
            <exemplum xml:lang="en">
              <p>The example below comes from a part of the CoMParS (Collection of Multi-lingual Parallel
                Sequences) project and presents a fragment of a monolingual subcorpus of German.</p>
              <p>The individual sequences (in this case, a sentence) are listed in the <gi>text</gi> part
                of the corpus, while the linguistic analysis is performed in the <gi>standOff</gi> part,
                which consists, among others, of segmentation information. CoMParS adheres to ISO LAF
                principles and uses inter-character points with the indexing starting at 0.</p>

              <!--  @valid should become "true" once the modification is incorporated into the Guidelines   -->
              <egXML xmlns="http://www.tei-c.org/ns/Examples" valid="feasible">
                <text xml:lang="de">
                  <body>
                    <ab xml:id="deu-ab1" n="1">Ich habe mich im Winter in dich verliebt.</ab>
                  </body>
                </text>
                <!--        
'I'c'h' 'h'a'b'e' 'm'i'c'h' 'i'm' 'W'i'n't'e'r' 'i'n' 'd'i'c'h' 'v'e'r'l'i'e'b't'.'
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1  -->
                <standOff xmlns="">
                  <listAnnotation n="1" offsetBase="#deu-ab1" type="sequence">
                    <listAnnotation type="segmentation">
                      <seg startPos="0" endPos="3" xml:id="deu-ab1tok1">Ich</seg>
                      <seg startPos="4" endPos="8" xml:id="deu-ab1tok2">habe</seg>
                      <seg startPos="9" endPos="13" xml:id="deu-ab1tok3">mich</seg>
                      <seg startPos="14" endPos="16" xml:id="deu-ab1tok4">im</seg>
                      <seg startPos="17" endPos="23" xml:id="deu-ab1tok5">Winter</seg>
                      <seg startPos="24" endPos="26" xml:id="deu-ab1tok6">in</seg>
                      <seg startPos="27" endPos="31" xml:id="deu-ab1tok7">dich</seg>
                      <seg startPos="32" endPos="40" xml:id="deu-ab1tok8">verliebt</seg>
                      <seg startPos="40" endPos="41" xml:id="deu-ab1tok9">.</seg>
                    </listAnnotation>
                  </listAnnotation>
                </standOff>
              </egXML>
              <p>Segmentation information gathered above is subsequently used by all other (numerous) annotation layers.</p>
              <p>The CoMParS ODD contains the following statements that include <gi>seg</gi> and
                  <gi>listAnnotation</gi> into the att.offset class: <egXML
                  xmlns="http://www.tei-c.org/ns/Examples" valid="true">
                  <elementSpec ident="seg" module="linking" mode="change">
                    <classes mode="change">
                      <memberOf key="att.offset"/>
                    </classes>
                  </elementSpec>
                    <elementSpec ident="listAnnotation" module="spoken" mode="change">
                      <classes mode="change">
                        <memberOf key="att.offset"/>
                      </classes>
                    </elementSpec>
                </egXML>
              </p>
            </exemplum>
            <remarks versionDate="2017-03-21" xml:lang="en">
              <p>Two options are possible (and practiced) for the start index. Some systems assume
                that indexing starts with 0, some assume that the initial index value is 1. This
                decision should be documented in the header, together with other project-specific
                encoding decisions. Linguistic analysis in the ISO LAF (Linguistic Annotation
                Framework, ISO 24612), MAF (Morphosyntactic Annotation Framework, ISO 24611), as
                well as W3C XPointer assume inter-character points and indices starting at 0. W3C
                XPath counts characters, beginning at 1.</p>
            </remarks>
            <listRef>
              <ptr target="#STECAT"/>
            </listRef>
          </classSpec>

Aha, that formulation precedes the recent changes in the Guidelines, so some tweaking may be in order. But if the general direction is found interesting for Parla-CL, I can take care of the details and synchronization (in fact, that would be a nice prompt for me to push ISO MAF ahead).

Two words on naming the attributes: A version of this used to be in my att.referring ticket that I withdrew at some point. That ticket overloaded the existing @from and @to attributes, but on Laurent Romary's advice, I resigned from trying to talk the Council into overloading the names. This suggestion doesn't touch the other existing pair, @start and @end, with different definitions, and instead uses the safe pair of @startPos and @endPos, and assumes a straightforward mapping into GrAF and friends.

In MAF, <span> and <seg> are added to it, and it's straightforward to treat <w> likewise, should someone desire to.

TomazErjavec commented 3 years ago

I need to create sentence <s> level synchronization with audio track but I probably found a bug in parla-clarin docs

Indeed you did! I have now fixed the text.

and no suitable solution of my issue...

With this I don't quite agree, although the solution is somewhat implicit in the text and examples (I fixed this now, at least a bit), i.e. use the <anchor> element. In fact, in the GosVL corpus we use it exactly for exactly the case you have:

         <u xml:id="GosVL01_pravo.u2"
            who="#Bm-vl007"
            start="#GosVL01_pravo.t1"
            end="#GosVL01_pravo.t20">
            <anchor synch="#GosVL01_pravo.t1"/>
            <seg xml:id="GosVL01_pravo.s1">
               <w xml:id="GosVL01_pravo.s1.w1" lemma="hvala" ana="mte:Ncfsn">hvala</w>
               <w xml:id="GosVL01_pravo.s1.w2" lemma="za" ana="mte:Sa">za</w>
               <w xml:id="GosVL01_pravo.s1.w3" lemma="beseda" ana="mte:Ncfsa">besedo</w>
               <vocal type="voice"/>
            </seg>
            <anchor synch="#GosVL01_pravo.t2"/>
            <seg xml:id="GosVL01_pravo.s2">
               <w xml:id="GosVL01_pravo.s2.w1" lemma="lep" ana="mte:Agpmsan">lep</w>
               <w xml:id="GosVL01_pravo.s2.w2" lemma="pozdrav" ana="mte:Ncmsan">pozdrav</w>
               <w xml:id="GosVL01_pravo.s2.w3" lemma="ves" ana="mte:Pg-mpd">vsem</w>
               <w xml:id="GosVL01_pravo.s2.w4" lemma="skupaj" ana="mte:Rgp">skupaj</w>
            </seg>
            ...
          </u>

But I agree it is not the most elegant of solutions. Maybe it is worth a ticket for TEI, i.e. that att.timed should be allowed also on some analysis elements, in particular seg, s, w?

TomazErjavec commented 3 years ago

I have a customization that I was (still am) going to propose as a Feature Request to the TEI Council, via ISO MAF. @TomazErjavec -- if you'd like to use it in the Parla-Clarin ODD, I'd be super happy because that would add to the range of real-life cases that the standard encodes.

@bansp, as far as I can see, @matyaskopp is asking about how to synchronise the audio with the transcription, but your proposal deals with pointing using character offsets from one piece of text to another, so, not quite sure how it is related. For audio we already have adequate attributes, and there is no need to introduce new ones.

Which does not mean that I think your proposal is not needed, just maybe not in this context.

bansp commented 3 years ago

Ouch, my bad, and I actually have no excuse for misunderstanding Matyáš, because he's very clear about the context (as I can see upon re-reading). Thanks for correcting me :-)

matyaskopp commented 3 years ago

With this I don't quite agree, although the solution is somewhat implicit in the text and examples (I fixed this now, at least a bit), i.e. use the element. In fact, in the GosVL corpus we use it exactly for exactly the case you have

@TomazErjavec , yes <anchor> is definitely a solution. But I am not sure if I like it. I want to represent starting and ending time points for each sentence. So the solution with <anchor> would be:

<u>
  <seg>
    <anchor synch="#s1.start" />
    <s xml:id="s1">
      <w xml:id="s1.w1">First</w>
      <w xml:id="s1.w2">sentence</w>
    </s>
    <anchor synch="#s1.end" />
    <anchor synch="#s2.start" />
    <s xml:id="s2">
      <w xml:id="s1.w1">First</w>
      <w xml:id="s1.w2">longer</w>
      <w xml:id="s1.w3">sentence</w>
    </s>
    <anchor synch="#s2.end" />
 </seg>
</u>

which looks strange. That's why I said that I have not a suitable solution.

I have been thinking about using <standOff> annotations that allowed me to trick missing att.timed attributes in sentence element, but it is more jumping over xml elements...

<standOff type="sentTiming">
  <annotationBlock start="#s1.start" end="#s1.end">
      <spanGrp type="sentence"><span from="#s1.w1" to="s1.w2" /></spanGrp> 
  </annotationBlock>
  <annotationBlock start="#s2.start" end="#s2.end">
      <spanGrp type="sentence"><span from="#s2.w1" to="s2.w3" /></spanGrp> 
  </annotationBlock>
  <timeline xml:id="TL" unit="ms">
      <when xml:id="TL.t0" absolute="2020-11-19T08:08:00"/>
      <when xml:id="s1.start" interval="3400" since="#TL.t0"/>
      <when xml:id="s1.end" interval="4000" since="#TL.t0"/>
      <when xml:id="s2.start" interval="4800" since="#TL.t0"/>
      <when xml:id="s2.end" interval="5200" since="#TL.t0"/>
  </timeline>
</standOff>
TomazErjavec commented 3 years ago

So the solution with <anchor> would be [two anchors in a row] which looks strange.

It doesn't look strange to me, well, not any stranger than the solution with one anchor - it's a completely logical extension of the original approach. Note that it is also easy (I think) to XPath when a seg starts and ends.

have been thinking about using <standOff> annotations

I've never used standoff, but at first glance what you propose seems fine - except, that it is a lot more verbose than the two-anchor solution, also, I think, more difficult to implement resolving seg start-end. Anyway, I wouldn't introduce standoff into ParlaMint just for one edge case. I would gladly introduce anchors, if you think you need them.

matyaskopp commented 3 years ago

It doesn't look strange to me, well, not any stranger than the solution with one anchor - it's a completely logical extension of the original approach. Note that it is also easy (I think) to XPath when a seg starts and ends.

Ok, my anchor solution proposal:

<u>
  <seg>
    <s xml:id="s1">
      <anchor synch="#s1.start" />
      <w xml:id="s1.w1">First</w>
      <w xml:id="s1.w2">sentence</w>
      <anchor synch="#s1.end" />
    </s>
    <s xml:id="s2">
      <anchor synch="#s2.start" />
      <w xml:id="s1.w1">First</w>
      <w xml:id="s1.w2">longer</w>
      <w xml:id="s1.w3">sentence</w>
      <anchor synch="#s2.end" />
    </s>
 </seg>
</u>

Let me explain where I see a difference between two anchor solutions: (1st) keeps anchor outside element s and (2nd) keeps anchor inside s.

I think that (2nd) solution is better approaching of unsupported construction <s start="..." end="..." > - it signifies that anchor is at the very beginning (end) of a sentence. But (1st) signifies only that anchor points to the time point that is before sentence starting time. There can be a gab between anchor and sentence beginning.

Furthermore, anchors are sentence features. It makes sense to store them inside sentences.

I would gladly introduce anchors, if you think you need them.

I would like to do an audio alignment in our ParCzech project and if I will be able to provide it before our ParlaMint deadline. I would like to propagate it to ParlaMint files. So I would like to implement audio synchronization in ParlaMint acceptable way.

TomazErjavec commented 3 years ago

it signifies that anchor is at the very beginning (end) of a sentence.

OK, this is true.

But (1st) signifies only that anchor points to the time point that is before sentence starting time. There can be a gap between anchor and sentence beginning.

Well, yes, under a somewhat perverse interpetation. I would say that if an anchor immediately precedes s, it can be taken to indicate there is no gap, otherwise you would have some other transcription elements in between.

But I would immediatelly accept your s/anchor (and, by analogy seg/anchor) solution, except for two things

You can also have word-level synchornisaiton, and w/anchor seems really ugly, e.g.

<w xml:id="s1.w1"><anchor synch="#s1.start"/>First<anchor synch="#s1.end" /></w>

Second, you can have less anchors with the "outside" solution no. 1. - if there is no gap, as there typically won't be between words, you can leave out one anchor. You can't do that in your "inside" proposal, as it then semi-mutates into the "outside" one. Thoughts?

In any case, I'd wait with implementing whichever solution in the schema till he have clarin-eric/ParlaMint#46 resolved.

matyaskopp commented 3 years ago

You can also have word-level synchornisaiton, and w/anchor seems really ugly, e.g.

Agree

Second, you can have less anchors with the "outside" solution no. 1. - if there is no gap, as there typically won't be between words, you can leave out one anchor. You can't do that in your "inside" proposal, as it then semi-mutates into the "outside" one. Thoughts?

If we allow that not all words should be aligned we can merge our two solutions:

<u>
  <seg>
    <s xml:id="s1">
      <anchor synch="#s1.w1.start" />
      <w xml:id="s1.w1">First</w>
      <w xml:id="s1.w2">sentence</w>
      <anchor synch="#s1.w2.end" />
    </s>
    <s xml:id="s2">
      <anchor synch="#s2.w1.start" />
      <w xml:id="s1.w1">First</w>
      <w xml:id="s1.w2">longer</w>
      <w xml:id="s1.w3">sentence</w>
      <anchor synch="#s2.w3.end" />
    </s>
 </seg>
</u>

Anchors in this solution belong to the nearest siblings. So it is exactly your (1st) proposal. But we all know that there is no gap between sentence start/end and first/last word (<s>[NO gap]<w/>[NO gap]</s>). So anchors somehow say when the sentences start/end.

In any case, I'd wait with implementing whichever solution in the schema till he have clarin-eric/ParlaMint#46 resolved.

Agree, I did not want to solve it now I just wanted to open a discussion. I have a colleague who will run the synchronization and I will only put it to TEI files. It will need a nontrivial computation time, so I have to think ahead.

TomazErjavec commented 3 years ago

If we allow that not all words should be aligned we can merge our two solutions

I don't see that we can allow that - forced alignment will typically sync all the words as far as I know (and that's the way you want it).

So it is exactly your (1st) proposal.

Exactly :) So, I am not convinced, I think outside is more natural/simpler/general.

But to get back to the original bugbear - I would try to convince TEI to have from/to on seg, w, pc. Which would solve all the problems, except, even if the council accepts this, it will take a while to appear in the Guidelines. Do you plan to include your seg sync in ParlaMint?

matyaskopp commented 3 years ago

But to get back to the original bugbear - I would try to convince TEI to have from/to on seg, w, pc. Which would solve all the problems, except, even if the council accepts this, it will take a while to appear in the Guidelines. Do you plan to include your seg sync in ParlaMint?

We need synchronized audio data for another project. So I am planning to do it. But I don't think that I will be able to provide data to you before the deadline. And furthermore, it quite misses the original ParlaMint goal.

So I do not plan to include alignment in ParlaMint. But according to email discussion (subj. Audio recordings of parliamentary speeches?), it seems that others are interested in audio alignment.

TomazErjavec commented 3 years ago

So I do not plan to include alignment in ParlaMint.

OK, so we can close this?

But according to email discussion (subj. Audio recordings of parliamentary speeches?), it seems that others are interested in audio alignment.

Let's see if they will want to do it below the level of <u> and then we can re-open it. Anyway, there was some talk about rather trying for a new "ParlaSpeech" project...

matyaskopp commented 3 years ago

Ok, closing.

Thanks for your help.