clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

ES-CT: Wrong paragraph segmentation #797

Open matyaskopp opened 1 year ago

matyaskopp commented 1 year ago

Paragraph <seq> is splitted by notes:

(printscreen from TEITOK) image

<seg xml:id="ParlaMint-ES-CT_2019-12-18-4602.14.0.1" xml:lang="es">En las <!-- SKIPPING --> absolutamente</seg>
<vocal type="murmuring" xml:id="ParlaMint-ES-CT_2019-12-18-4602.vocal4">
    <desc>remor de veus</desc>
</vocal>
<seg xml:id="ParlaMint-ES-CT_2019-12-18-4602.14.0.2" xml:lang="es">inaceptables <!-- SKIPPING --> Puigcercós.</seg>
<seg xml:id="ParlaMint-ES-CT_2019-12-18-4602.14.0.3" xml:lang="es">En <!-- SKIPPING --> Budó</seg>
<vocal type="murmuring" xml:id="ParlaMint-ES-CT_2019-12-18-4602.vocal5">
    <desc>persisteix la remor de veus</desc>
</vocal>
<seg xml:id="ParlaMint-ES-CT_2019-12-18-4602.14.0.4" xml:lang="es">, los catalanes merecemos...</seg>

should be:

<seg xml:id="ParlaMint-ES-CT_2019-12-18-4602.14.0.1" xml:lang="es">En las <!-- SKIPPING --> absolutamente <vocal type="murmuring" xml:id="ParlaMint-ES-CT_2019-12-18-4602.vocal4">
    <desc>remor de veus</desc>
  </vocal> inaceptables <!-- SKIPPING --> Puigcercós.</seg>
<seg xml:id="ParlaMint-ES-CT_2019-12-18-4602.14.0.3" xml:lang="es">En <!-- SKIPPING --> Budó <vocal type="murmuring" xml:id="ParlaMint-ES-CT_2019-12-18-4602.vocal5">
    <desc>persisteix la remor de veus</desc>
  </vocal>, los catalanes merecemos...</seg>