Closed matyaskopp closed 1 year ago
Is it OK now?
I believe that the label should be Stanza
https://github.com/Keeleressursid/ParlaMint/blob/9d8494c2a69ee2f1957e47b2968346e9776ae5ee/Data/ParlaMint-EE/ParlaMint-EE.ana.xml#L111
<application ident="Stanza" version="1.3.0">
<label>EstNLTK</label>
<desc>POS tagging, lemmatization and dependency parsing done with Stanza ver. 1.3.0</desc>
</application>
@nemeek, I have updated the ticks bove You have updated TEI.ana version, which now does not correspond to the TEI version. eg:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="et" xml:id="ParlaMint-EE_2022-01-13.ana" ana="#covid">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" type="main">Estonian parliamentary corpus ParlaMint-EE, 2022-01-13 [ParlaMint.ana SAMPLE]</title>
<title xml:lang="en" type="sub">XIV Parliament term, VII istungjärk, Plenary Assembly meeting on Thursday, 13.01.2022, 10:00</title>
<title xml:lang="et" type="sub">XIV Riigikogu, VII istungjärk, täiskogu istung Neljapäev, 13.01.2022, 10:00</title>
<meeting ana="#parla.uni #parla.sitting" n="2022-01-13">2022-01-13</meeting>
<meeting ana="#parla.uni #parla.session" n="rs7">VII regular session</meeting>
<meeting ana="#parla.uni #parla.term #parliament.RK14">XIV Riigikogu</meeting>
vs.
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="et" xml:id="ParlaMint-EE_2022-01-13" ana="#covid">
<teiHeader>
<fileDesc>
<titleStmt>
<title type="main" xml:lang="en">Estonian parliamentary corpus ParlaMint-EE, 2015-01-12 [ParlaMint SAMPLE]</title>
<meeting ana="#parla.meeting.regular">Korraline istung</meeting>
And still, some (add file content classification, notes are not recognized in speeches) suggestions are not implemented in TEI.ana version. If anything is unclear in https://github.com/clarin-eric/ParlaMint/issues/495#issue-1470191277 please ask. I will provide you with a sample or try to clarify it.
Now the TEI files are updated also according to TEI.ana pattern. What do you mean by 'notes are not recognized in speeches', the additional info about sitting is encoded into <title type="sub" elements.
Stanza
* [ ] Stanza
I believe that the label should be
Stanza
https://github.com/Keeleressursid/ParlaMint/blob/9d8494c2a69ee2f1957e47b2968346e9776ae5ee/Data/ParlaMint-EE/ParlaMint-EE.ana.xml#L111<application ident="Stanza" version="1.3.0"> <label>EstNLTK</label> <desc>POS tagging, lemmatization and dependency parsing done with Stanza ver. 1.3.0</desc> </application>
Maybe. Tool is EstNLTK, model is Stanza.
url is not matching the file, it is referring to 2016-09-26
<idno type="URI">https://stenogrammid.riigikogu.ee/et/201609261500</idno>
should be
<idno type="URI">https://stenogrammid.riigikogu.ee/et/202201131000</idno>
@nemeek I rechecked your sample and ticked off the solved suggestions. There are still some left.
@matyaskopp I have a comment here about notes inside speeches. Sometimes they are marked in brackets inside the text or instead of the text, but sometimes the edited transcripts contain part of what the speaker said in brackets. This means that it is not possible to extract the notes from the texts automatically. And as a manual task - reading through all the transcripts and marking them manually - this is beyond the scope of what we can do.
Is it possible to make a compromise? I believe that there are common notes that are repeated so that they can be searched and annotated. I don't know your full data, but if you can grep all notes and annotate the most common ones...
Will look into that and get back to you here about a possible solution.
Alright, we can create a manual list of most likely notes (which will probably include most, but definitely not all notes; and will probably have a few instances of text that are not actually notes) and use that to mark the notes inside speeches.
EE data should be OK now.
@martinmolder , @nemeek I have rechecked your data, and updated ticks:
missing NER tool in application description
https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE.ana.xml#L105-L110
settingDesc should contain corpus timespan in corpus root file
date
TEIdate
TEI.anaana="#parla.meeting"
https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE.ana.xml#L117
meeting
element in root filecorresp
ana
- parliament event(eg#parliament.RK11
) +#parla.uni
see GR sample: https://github.com/clarin-eric/ParlaMint/blob/e37537be54721c40bf6687cd12d9361759e6b234/Data/ParlaMint-GR/ParlaMint-GR.xml#L10-L12
meeting
element in component filesterm
/session
/meeting
/sitting
(if make sense)eg: https://github.com/clarin-eric/ParlaMint/blob/e37537be54721c40bf6687cd12d9361759e6b234/Data/ParlaMint-GR/ParlaMint-GR_2015-02-06-S1-commons.xml#L13-L15
Dates mismatch/confusion
https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L2
https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L6
https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L62
https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L41
add file content classification
#parla.meeting
or#parla.sitting
into/TEI/@ana
attribute
//setting/date/@ana
should contain the same valuehttps://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L2
Why haven't you included links to youtube with recordings?
I think it is a pity that you haven't included youtube links. Your parliament have it synchronized with speeches.
notes are not recognized in speeches
eg: https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L76
missing initial comments in transcriptions
spaceless lemmas and words
Should be lemmatized in this way:
This produces validation error because lemmas and words is expected to be spaceless.
in the corpus root you are declaring that you are doing space normalization: https://github.com/Keeleressursid/ParlaMint/blob/d1f60ab11466d376e0481f917a1e5b0eb555eeb2/Data/ParlaMint-EE/ParlaMint-EE.xml#L58
so I think you can remove spaces in numbers to solve this error