Parlamint-en data formats

clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora

https://clarin-eric.github.io/ParlaMint/

41 stars 52 forks source link

Parlamint-en data formats #784

Closed jonatankrause closed 10 months ago

jonatankrause commented 1 year ago

Hi,

Disclaimer: limited programming experience ... please bear with me - I'm trying :)

Regarding the english-language parlamint data: Previous datasets Parlamint datasets I have worked included accessible versions that were .tsv or .txt formatted. I was to able to merge these files into dataframes in R and python, but it seems that the english-language data (https://www.clarin.si/repository/xmlui/handle/11356/1810) only exists in formats that I'm not familiar with (including xml). I spent a couple of hours trying to figure out how to read the xml formatted data into a regular dataframe, but so far with no progress.

1. Are there any resources available here or at the Clarin website on how to convert xml formatted data into a dataframe in R or Python?

If not,

2. I was wondering if you plan on including .txt/.tsv formats in future versions? (and if so, when are these expected)

Thank you in advance - and apologies for a simple question ...

matyaskopp commented 1 year ago

You can convert PArlaMint TEI to text with this script: Scripts/parlamint-tei2text.xsl We use this script for conversion on non-annotated TEI files, but it should also work for annotated version.

java -jar saxon.jar -xsl:Scripts/parlamint-tei2text.xsl component-file.xml > component-file.txt

Saxon from saxonica.com is recommended because the system saxon does not support all XSLT features.

The TSV file with metadata can be used from non-translated version - we haven't changed ids.

I don't know if we want to include .txt/.tsv files in the ParlaMint-en version, and I haven't made an opinion yet. @TomazErjavec Now I see that it has been accidentally generated with GitHub action (bug or feature):

TomazErjavec commented 1 year ago

Thanks for bringing this up @jonatankrause: it was simply and oversight because I used to generate the plain text version of the corpus (= ParlaMint-XX.txt/) from the plain text XML version (= ParlaMint-XX.TEI), and this one doesn't exist for ParlaMint-XX-en. But now .txt can be generated from either.

The metadata files TSV are included in the ParlaMint-XX-en.conllu/ directories. But yes, as @matyaskopp writes, they are the basically the same as for http://hdl.handle.net/11356/1486.

It might be hard to process the XML corpora with the XSLT script, if nothing else, you need a big unix machine. But if you have a bit of programming, you could take the vertical files, get rid of the tags in pointy brackets, and keep the first column (i.e. the words) of the rest.

Or wait for ParlaMint-en 3.1, ParlaMint-en.txt/ will definitelly be in there! Leaving this issue open, until I can prove it is.

jonatankrause commented 1 year ago

Hi again,

Thanks. From your response I gathered that the xml files in the .TEI.ana directories present the text information in a vertical format without metadata.

I was able to automate a conversion of all the .xml files into .txt files (using the saxon program you referenced), and then move these into separate directories along with the .tsv metadata from the .conllu directories. So now I have a text-metadata (.txt .tsv) structure similar to the multilingual datasets that I've worked with earlier (took most of the day, but here we are:).

Thank you. And I appreciate that the txt versions will be in the next release :).

TomazErjavec commented 10 months ago

OK, the .txt is now added to the ana release, which includes the MTed version. So, will be available in release 4.0, coming soon.