clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

NO: crazy characters in ParlaMint-NO_2018-12-10 #583

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

documents ParlaMint-NO_2018-12-10.xml and ParlaMint-NO_2018-12-10.ana.xml containst crazy characters: Begining of the file

   <text ana="#reference">
      <body>
         <div type="debateSection">
            <note type="comment">Møte tirsdag den 11. desember 2018 kl. 10�</note>
            <note type="comment">President:</note>
            <note type="comment">Eva Kristin Hansen</note>
            <note type="comment">�</note>
            <note type="comment">Dagsorden</note>
            <note type="comment">(nr. 29):</note>
            <note type="comment">�</note>
            <note type="comment">1. Innstilling fra næringskomiteen om Bevilgninger på statsbudsjettet for 2019, kapitler under Nærings- og fiskeridepartementet, Klima- og miljødepartementet og Landbruks- og matdepartementet (rammeområdene 9, 10 og 11)�</note>
            <note type="comment">(Innst. 8 S (2018–2019), jf. Prop. 1 S (2018–2019))�</note>
<!-- skipping notes -->
            <note type="comment">7. Referat�</note>
            <note type="speaker">Presidenten:</note>
            <u who="#person.EVH" ana="#chair" xml:id="ParlaMint-NO_2018-12-11.ana.ud331e54" xml:lang="nb">
               <seg xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55">
                  <s xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1">
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.1" lemma="representant" msd="UPosTag=NOUN|Definite=Def|Gender=Masc|Number=Sing">Representanten</w>
                     <name type="PER">
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.2" lemma="Mazyar" msd="UPosTag=PROPN">Mazyar</w>
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.3" lemma="Keshvari" msd="UPosTag=PROPN">Keshvari</w>
                     </name>
<pc xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.4" msd="UPosTag=PUNCT">,</pc>
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.5" lemma="som" msd="UPosTag=PRON|PronType=Rel">som</w>
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.6" lemma="ha" msd="UPosTag=AUX|Mood=Ind|Tense=Pres|VerbForm=Fin">har</w>
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.7" lemma="være" msd="UPosTag=AUX|VerbForm=Part">vært</w>
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.8" lemma="permittere" msd="UPosTag=VERB|VerbForm=Part" join="right">permittert</w>
<pc xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.9" msd="UPosTag=PUNCT">,</pc>
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.10" lemma="ha" msd="UPosTag=AUX|Mood=Ind|Tense=Pres|VerbForm=Fin">har</w>
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.11" lemma="igjen" msd="UPosTag=ADV">igjen</w>
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.12" lemma="ta" msd="UPosTag=VERB|VerbForm=Part">tatt</w>
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.13" lemma="sete" msd="UPosTag=NOUN|Definite=Ind|Gender=Neut|Number=Sing" join="right">sete</w>
<pc xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.14" msd="UPosTag=PUNCT">.</pc>
                     <name type="ORG">
<w xml:id="ParlaMint-NO_2018-12-11.ana.segd331e55.1.15" lemma="$�" msd="UPosTag=PUNCT">�</w>
                     </name>
<!-- ... -->
tungland commented 1 year ago

Yes. I don't know why the transcribers put it there, but they did. Source file: https://data.stortinget.no/eksport/publikasjon?publikasjonid=refs-201819-12-11

We unfortunately don't have capacity to manually check 24 years of parliamentary debates. There are many spelling errors, unconventional use of characters and so on. All the national corpora must be like this. And correcting this particular error would be quite a bit of work, as simply deleting the symbol would surely break the dependency links in the .ana file.

@matyaskopp I would really prefer to leave it as is.

matyaskopp commented 1 year ago

Agree that this character is one of many (but it is really terribly looking). But it seems that it is allowed in TEI (@TomazErjavec ?)

There is another issue with characters (see documentation: https://clarin-eric.github.io/ParlaMint/#sec-chars) I have run Script/chars.pl and Scripts/chars-summ.pl on your ParlaMint-NO.TEI data and filtered HYPHEN and SPACE characters:

Code    Char    Occurs  %   In docs %   Unicode name
U+0020  <CTRL>   156510934  18.44       3280    100.00  SPACE
U+002D  -      6706215   0.79       3280    100.00  HYPHEN-MINUS
U+2002             154   0.00         31     0.95   EN SPACE
U+2003             175   0.00         44     1.34   EM SPACE
U+2005               5   0.00          4     0.12   FOUR-PER-EM SPACE
U+2006               9   0.00          8     0.24   SIX-PER-EM SPACE
U+2009             779   0.00        155     4.73   THIN SPACE
U+200A               6   0.00          3     0.09   HAIR SPACE
U+2011  ‑           73   0.00         12     0.37   NON-BREAKING HYPHEN

U+0020 and U+002D are ok, but the rest should be replaced

tungland commented 1 year ago

@matyaskopp If these are illegal why are they not mentioned in the list of illegal characters, and why were they not checked for in validation?

We simply do not have the capacity to redo the corpus now. If this was an issue, we really would have needed to know about this before.

tungland commented 1 year ago

OK I see some of them are among the illegal chars. Unfortunate. Should have been normalized. Quickly normalizing the nonestandard spaces to U 0020 and U 2011 and the U 2011 to U 002D should be quickly done. Hopefully it won't break the ana docs. But the Replacement char I don't know if we are able to deal with. It could be normalized maybe, but to what... anyway, it seems it is not an illegal character at least

tungland commented 1 year ago

I ran a simple replace script on the corpus. That should do it for the whitespace and hyphen.

ParlaMint-NO.TEI ParlaMint-NO.TEI.ana

matyaskopp commented 1 year ago

@matyaskopp If these are illegal why are they not mentioned in the list of illegal characters, and why were they not checked for in validation?

I agree that the validation is far from complete (added an issue #586 ) - there are many documented features that are not validated, but the documentation is quite clear.

I ran a simple replace script on the corpus. That should do it for the whitespace and hyphen.

ParlaMint-NO.TEI ParlaMint-NO.TEI.ana

Thanks, @TomazErjavec, can you update the NO corpus, please?

TomazErjavec commented 1 year ago

Thanks, @TomazErjavec, can you update the NO corpus, please?

Already did it but forgot to let you know, sorry! The log https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-NO.log and the corpus is also on the beta concordancer https://www.clarin.si/noske-beta/parlamint30.cgi/corp_info?corpname=parlamint30_no&struct_attr_stats=1&subcorpora=1

Looks ok to me, except the (already seen) CoNLL-U parse problems.

TomazErjavec commented 1 year ago

@tungland, do you plan to fix the remaining char issues for 3.1 (and I put that milestone to this issue), or nor, and we close it?

tungland commented 1 year ago

I think i already submitted a corpus without these? It was way back

TomazErjavec commented 1 year ago

OK, great, so, closing.

tungland commented 1 year ago

I think i already submitted a corpus without these? It was way back

Edit: yes, see this comment https://github.com/clarin-eric/ParlaMint/issues/583#issuecomment-1387256829