clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
51 stars 53 forks source link

CZ: audio file path does not correspond to AudioPSP 24.01 #849

Closed matyaskopp closed 9 months ago

matyaskopp commented 9 months ago

the current value is url="2013ps/audio/2016/10/27/2016102714281442.mp3" https://github.com/clarin-eric/ParlaMint/blob/cb93f7eb5002b6bd608600a6c800accfdce9c72b/Samples/ParlaMint-CZ/ParlaMint-CZ_2016-10-27-ps2013-050-07-005-262.xml#L59

but it should be url="audio/psp/2016/10/27/2016102714281442.mp3"

so the data from this record will be possible to use:

Kopp, Matyáš, 2024, AudioPSP 24.01: Audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-5404.

this script fixes it in ParCzech:

  <xsl:template match="tei:recording[@type='audio']/tei:media/@url">
    <xsl:attribute name="url" select="replace(.,'^[0-9]*ps/audio/','audio/psp/')"/>
  </xsl:template>

but I believe it is safe to use regex on XML, s/url="[0-9]*ps\/audio\//url="audio\/psp\//

@TomazErjavec, should I do it and insert the fix to my tantra-home? or will you process it yourself?

TomazErjavec commented 9 months ago

@TomazErjavec, should I do it and insert the fix to my tantra-home?

Yes please, and let me know when done and what I should do.

matyaskopp commented 9 months ago

I have used data from

/project/corpora/Parla/ParlaMint/ParlaMint/Corpora/Sources-TEI/

and place the result here:

/home/kopp/ParlaMint-CZ-4.1/

the Czech folders can be overwritten in Source-TEI

TomazErjavec commented 9 months ago

Done! Will process it as soon as the q empties.