clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

LV Feedback #590

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

taxonomies

You have changed some taxonomies.

It will cause troubles in ParlaMint v3.1, where we want to merge all translations of taxonomies into one.

You can extract taxonomies from the root file with:

# factorize taxonomies and list(Person|Org)
make factorize-teiHeader-INPLACE-LV
# add new files into the repository (taxonomies and list of persons and organizations)
git add Data/ParlaMint-SI/ParlaMint-LV-taxonomy-*.xml
git add Data/ParlaMint-SI/ParlaMint-taxonomy-*.xml
git add Data/ParlaMint-SI/ParlaMint-LV-list*.xml

So you will do changes in one place (most of the taxonomies are shared between TEI and TEI.ana versions)

idno type

https://clarin-eric.github.io/ParlaMint/#TEI.idno https://github.com/Skriptotajs/ParlaMint/blob/d2895e5fc0926974293c14973c7d5285e4e17b6b/Data/ParlaMint-LV/ParlaMint-LV.xml#L293

<idno type="wikimedia" xml:lang="lv">https://lv.wikipedia.org/wiki/Saeima</idno>

should be

<idno type="URI" subtype="wikimedia" xml:lang="lv">https://lv.wikipedia.org/wiki/Saeima</idno>

term in component file

Different term number in title and meeting https://github.com/Skriptotajs/ParlaMint/blob/d2895e5fc0926974293c14973c7d5285e4e17b6b/Data/ParlaMint-LV/ParlaMint-LV_2014-11-11-PT12-270.xml#L9-L12

            <title type="main" xml:lang="lv">Latvijas parlamenta corpus ParlaMint-LV, 12. Saeima, 2014-11-11 [ParlaMint]</title>
            <title type="main" xml:lang="en">Latvian parliamentary corpus ParlaMint-LV, 12th Term, 2014-11-11 [ParlaMint]</title>
            <meeting corresp="#PT" ana="#parla.meeting.regular">Regulārā</meeting>
            <meeting n="13" corresp="#PT" ana="#parla.term #PT.13">13. sasaukums</meeting>

meeting - sitting

The sitting is stored in one file, the title stores its date, but the <meeting> element which should somehow record the information from title doesn't, so I am suggesting adding:

<meeting n="2014-11-11" corresp="#PT" ana="#parla.sitting">2014-11-11</meeting>
Skriptotajs commented 1 year ago

The corpus validates with original taxonomies, so the LV one is just a subset

matyaskopp commented 1 year ago

Thanks, it is much better. Your data are almost ready to merge.

There is only one thing I have spotted: The NER taxonomy file does not contain translations, and the English version has no attribute xml:lang="en" https://github.com/Skriptotajs/ParlaMint/blob/763338f4f2419ec3d4f070e95700e7e2fd57aa27/Data/ParlaMint-LV/ParlaMint-taxonomy-NER.ana.xml

It should look like this: https://github.com/clarin-eric/ParlaMint/blob/af4155773fcd05f1b85ffa0443330dfdd36533f9/Data/ParlaMint-UA/ParlaMint-taxonomy-NER.ana.xml#L1-L18

Skriptotajs commented 1 year ago

Translated NER taxonomy

matyaskopp commented 1 year ago

Thanks.