clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

XML entities not decoded in vert format #699

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

@AnnaParla reported this for UA corpus: Фракція політичної партії Всеукраїнське об'єднання "Батьківщина" (and many other parliamentary groups) https://www.clarin.si/ske-beta/#text-type-analysis?corpname=parlamint30_ua&wlminfreq=1&wlicase=1&include_nonwords=1&showresults=1&wlnums=frq&wlattr=speech.speaker_party_name I also saw it in PT corpus: Grupo Parlamentar do Partido Ecologista "Os Verdes" https://www.clarin.si/ske-beta/#text-type-analysis?corpname=parlamint30_pt&wlminfreq=1&wlicase=1&include_nonwords=1&showresults=1&wlnums=frq&wlattr=speech.speaker_party_name

@TomazErjavec, I remember we were discussing this maybe two years back. I don't remember if there was anything you could do... Perhaps a new nosketch solved this???

<speech id="ParlaMint-UA_2023-02-07-m0.u2" 
    ...  
    speaker_party="фЄС" 
    speaker_party_name="Фракція політичної партії &#34;Європейська солідарність&#34;" 
    ...>
<!-- -->
</speech>

Other possible solutions: We can recommend avoiding " character or replacing it with a different one in conversion to vert

TomazErjavec commented 1 year ago

Well spotted. I do in fact change XML entities to chars in the vertical files, but forgot we can have not only &quot; but also character entities. Fixed now in 85ed583, as well as the corpus on the concordancers. But, of course, I found other bugs now, sigh...