clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

PT: wrong/no sentence segmentation #825

Open matyaskopp opened 10 months ago

matyaskopp commented 10 months ago

The source of transcriptions of PT debates does not seem to contain paragraphs, but in the corpus, it is somehow segmented into paragraphs (my guess is if the punctuation ./?/ is at the end of the line then paragraph<seg> ends)

https://debates.parlamento.pt/catalogo/r3/dar/01/13/04/035/2019-01-04?sft=true#p5 "paragraphs" are framed: image

The TEI:

<seg xml:id="ParlaMint-PT_2019-01-04.seg21">Em primeiro <!-- 
--> privada. A segurança <!--
--> complementar.</seg>

The TEI.ana:


<seg xml:id="ParlaMint-PT_2019-01-04.seg21">
  <s xml:id="ParlaMint-PT_2019-01-04.seg21.s">
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.1" msd="UPosTag=ADP" lemma="em">Em</w>
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.2" msd="UPosTag=ADJ|Gender=Masc|Number=Sing" lemma="primeiro">primeiro</w>
    <!-- -->
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.14" msd="UPosTag=ADJ|Gender=Fem|Number=Sing" lemma="privar,privado" join="right">privada</w>
    <pc xml:id="ParlaMint-PT_2019-01-04.seg21.s.15" msd="UPosTag=PUNCT">.</pc>
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.16" msd="UPosTag=DET|Gender=Fem|Number=Sing" lemma="a">A</w>
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.17" msd="UPosTag=NOUN|Gender=Fem|Number=Sing" lemma="segurança">segurança</w>
    <!-- -->
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.47" msd="UPosTag=ADJ|Gender=Fem|Number=Sing" lemma="complementar" join="right">complementar</w>
    <pc xml:id="ParlaMint-PT_2019-01-04.seg21.s.48" msd="UPosTag=PUNCT">.</pc>
    <linkGrp targFunc="head argument" type="UD-SYN"><!-- --> </linkGrp>
  </s>
</seg>