Closed matyaskopp closed 1 year ago
Not sure how am I supposed to mark the issues listed here done with all the issues in one GitHub issue. So, I will slowly fix them and will let you decide when to close it.
A few comments:
* very few government/parliament members – the sample contains only the persons from the sample.
Actually, it should contain all the persons (and orgs), even though they are not used in the sample. In this was we can Git-validate at least the root files completely, and the GitHub samples are good for computing statistics on the whole-corpus meta-data. This was the same in V1.
* wrong timespan – is 2022-06 the cutoff or should I add later sessions if I have them?
No, they should stop at cutoff.
* speech with interruptions – we are having all the sessions proofread. This should fix most of such problems, but I expect this to be available only by the end of January. Is that OK?
That is tight but maybe doable. Why don't we continue with debugging and when all is ok with the samples, submitting the complete PL, and once that validates, see if this corrected version becomes available in time.
* huge amount of L2 syntax errors – I get the analysis from external tool which I am not able to modify. Is the whole analysis wrong or can this be solved by some kind of automatic changes?
Matyaš already answered this:
This error is possible to fix by replacing such relation with root
@mrudolf Nice work, thanks. There is only one issue remaining: setting has wrong timespan: https://github.com/mrudolf/ParlaMint/blob/fc88bc78892680c2c182339b7459656182729199/Data/ParlaMint-PL/ParlaMint-PL.xml#L430
<setting>
<name type="address">ul. Wiejska 4/6/8</name>
<name type="city">Warszawa</name>
<name type="country" key="PL">Poland</name>
<date from="2015-11-06" to="2020-08-18">12.11.2015 - 14.8.2020</date>
</setting>
Thanks, I was sure I fixed that, but apparently I wasn't.
I would do it and then, once I reprocess a few files, should I submit the whole corpus to Tomaž?
Wiadomość napisana przez Matyáš Kopp @.***> w dniu 17.03.2023, o godz. 20:23:
@mrudolf https://github.com/mrudolf Nice work, thanks. There is only one issue remaining: setting has wrong timespan: https://github.com/mrudolf/ParlaMint/blob/fc88bc78892680c2c182339b7459656182729199/Data/ParlaMint-PL/ParlaMint-PL.xml#L430
<setting> <name type="address">ul. Wiejska 4/6/8</name> <name type="city">Warszawa</name> <name type="country" key="PL">Poland</name> <date from="2015-11-06" to="2020-08-18">12.11.2015 - 14.8.2020</date> </setting>
— Reply to this email directly, view it on GitHub https://github.com/clarin-eric/ParlaMint/issues/573#issuecomment-1474303762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAS5RAKWDCICXCLM37XYF7DW4S24PANCNFSM6AAAAAATNPJ4JE. You are receiving this because you were mentioned.
— Michał Rudolf
I would do it and then, once I reprocess a few files, should I submit the whole corpus to Tomaž?
exactly
& precisely :) looking forward to it.
very few government/parliament members
Does your sample contain all persons or just a sample? You have only 16 government affiliations that correspond to one term in Poland, I guess. The same situation is with parliament member: 218, and Senat: 7.
wrong timespan in
title
,bibl
, andsetting
title
bibl
setting
We expect content up to 2022-06. https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL.xml#L8-L9
https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL.xml#L71
https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL.xml#L430
missing terms events in parliament organizations
https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL-listOrg.xml#L7-L12
proper source link (nice to have)
The proper link to your sample file is: https://www.sejm.gov.pl/sejm8.nsf/wypowiedz.xsp?posiedzenie=5&dzien=2&wyp=0 https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L70
missing speaker notes
https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L119
strange note annotation
https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L157
speech in ()
(..)
https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L161
speech with interruptions
There are a lot of speeches of this kind in your sample
https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L478-L500
should be encoded this way:
huge amount of L2 syntax errors
DEPREL must be 'root' if HEAD is 0.
This error is possible to fix by replacing such relation with
root