clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

Data LT sample #610

Closed mindpetk closed 1 year ago

mindpetk commented 1 year ago

There are still some issues with the data.

ERROR ParlaMint-LT: Empty pointer! ERROR ParlaMint-LT.ana: Empty pointer!

It's a mystery to me where the errors are coming from. It could be helpful if you provided some explanations. Thanks.

ParlaMint-LT_1996-11-05-seimas-3-1.ana.xml:24240:107: error: element "u" incomplete; expected element "gap", "incident", "kinesic", "note", "pb", "seg" or "vocal"

These are easily fixable issues. Simply a tag with no content. Will be corrected in the following update.

matyaskopp commented 1 year ago

@mindpetk, please reduce the sample size: https://github.com/clarin-eric/ParlaMint/actions/runs/4298192587/jobs/7492041446#step:4:30 Remove one pair of TEI + TEI.ana file

mindpetk commented 1 year ago

@mindpetk, please reduce the sample size: https://github.com/clarin-eric/ParlaMint/actions/runs/4298192587/jobs/7492041446#step:4:30 Remove one pair of TEI + TEI.ana file

I deleted several files.

Also fixed overlapping dates and empty "seg" tags. Only empty pointer errors remain.

ERROR ParlaMint-LT: Empty pointer! ERROR ParlaMint-LT.ana: Empty pointer!

I'm not sure how to fix them. I would appreciate any assistance you can offer. Thanks.

matyaskopp commented 1 year ago

Only empty pointer errors remain.

ERROR ParlaMint-LT: Empty pointer! ERROR ParlaMint-LT.ana: Empty pointer!

I'm not sure how to fix them. I would appreciate any assistance you can offer. Thanks.

link checker expects space-normalized attributes. You have double spaces between references (#parliamentaryGroup.LCF.870 #parliamentaryGroup.SDKF.793):

<relation ana="#S.5" from="2006-07-06" mutual="#parliamentaryGroup.LCSF.10330 #parliamentaryGroup.LCF.870  #parliamentaryGroup.SDKF.793 #parliamentaryGroup.LSDPF.793 #parliamentaryGroup.PDF.980 #parliamentaryGroup.VLPDF.1000 #parliamentaryGroup.VNDF.7970 #parliamentaryGroup.VLPDF.1000 #parliamentaryGroup.VLF.797" name="coalition" to="2008-11-16"/>

@TomazErjavec is it a bug or feature? https://github.com/clarin-eric/ParlaMint/blob/031ec3009386a4bfec60bf0e22f653a813ddf98c/Scripts/check-links.xsl#L76-L77

TomazErjavec commented 1 year ago

@TomazErjavec is it a bug or feature?

Feature - W3C XML says "values of type IDREFS MUST match Names", and names is defined like this.

So, yes, they must be separated by single space.

matyaskopp commented 1 year ago

@mindpetk Can you please refactorize your files with the following procedure?

# factorize taxonomies and list(Person|Org)
make factorize-teiHeader-INPLACE-LT
# add new files into the repository (taxonomies and list of persons and organizations)
git add Data/ParlaMint-LT/ParlaMint-LT-taxonomy-*.xml
git add Data/ParlaMint-LT/ParlaMint-taxonomy-*.xml
git add Data/ParlaMint-LT/ParlaMint-LT-list*.xml
# commit changes and push to GitHub
git commit -m "LT: factorize header" Data/ParlaMint-LT/ParlaMint*.xml 
git push

I will then give you better feedback - currently, referring to the file lines is impossible because the files are too large.