clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
43 stars 53 forks source link

EE feedback #495

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

missing NER tool in application description

https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE.ana.xml#L105-L110

      <appInfo>
        <application ident="EstNLTK" version="1.6b">
          <label>EstNLTK</label>
          <desc>POS tagging, lemmatization and dependency parsing done with EstNLTK ver. 1.6b</desc>
        </application>
      </appInfo>

settingDesc should contain corpus timespan in corpus root file

https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE.ana.xml#L117

      <settingDesc>
        <setting>
          <name type="city">Tallinn</name>
          <name type="country" key="EE">Estonia</name>
          <date ana="#parla.meeting" when="2015-01-12">2015-01-12</date>
        </setting>
      </settingDesc>

meeting element in root file

see GR sample: https://github.com/clarin-eric/ParlaMint/blob/e37537be54721c40bf6687cd12d9361759e6b234/Data/ParlaMint-GR/ParlaMint-GR.xml#L10-L12

meeting element in component files

eg: https://github.com/clarin-eric/ParlaMint/blob/e37537be54721c40bf6687cd12d9361759e6b234/Data/ParlaMint-GR/ParlaMint-GR_2015-02-06-S1-commons.xml#L13-L15

Dates mismatch/confusion

https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L2

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="et" xml:id="ParlaMint-EE_2022-01-13" ana="#covid">

https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L6

        <title type="main" xml:lang="en">Estonian parliamentary corpus ParlaMint-EE, 2015-01-12 [ParlaMint SAMPLE]</title>

https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L62

          <date ana="#parla.meeting" when="2015-01-12">2015-01-12</date>

https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L41

<!-- XIII Riigikogu, IV Istungjärk, täiskogu korraline istung Esmaspäev, 26.09.2016, 15:00 -->
<idno type="URI">https://stenogrammid.riigikogu.ee/et/201609261500</idno>

add file content classification

attribute //setting/date/@ana should contain the same value

https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L2

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="et" xml:id="ParlaMint-EE_2022-01-13" ana="#covid">

Why haven't you included links to youtube with recordings?

I think it is a pity that you haven't included youtube links. Your parliament have it synchronized with speeches.

notes are not recognized in speeches

eg: https://github.com/Keeleressursid/ParlaMint/blob/de38a6b928e37c4527213c98b06b25736b1d5ebf/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L76

<seg xml:id="ParlaMint-EE_2022-01-13_U2-P1"> ... Mul on hea meel (Juhataja helistab kella.) sotsiaaldemokraatide ...</seg>

missing initial comments in transcriptions

spaceless lemmas and words

<w xml:id="ParlaMint-EE_2015-01-12_U72-P3.10.5" lemma="10 000"  ...>10 000</w>

Should be lemmatized in this way:

<w xml:id="ParlaMint-EE_2015-01-12_U72-P3.10.5" lemma="10000"  ...>10000</w>

This produces validation error because lemmas and words is expected to be spaceless.

in the corpus root you are declaring that you are doing space normalization: https://github.com/Keeleressursid/ParlaMint/blob/d1f60ab11466d376e0481f917a1e5b0eb555eeb2/Data/ParlaMint-EE/ParlaMint-EE.xml#L58

<p xml:lang="en">Text has not been normalised, except for spacing.</p>

so I think you can remove spaces in numbers to solve this error

nemeek commented 1 year ago

Is it OK now?

matyaskopp commented 1 year ago

Stanza

I believe that the label should be Stanza https://github.com/Keeleressursid/ParlaMint/blob/9d8494c2a69ee2f1957e47b2968346e9776ae5ee/Data/ParlaMint-EE/ParlaMint-EE.ana.xml#L111

        <application ident="Stanza" version="1.3.0">
          <label>EstNLTK</label>
          <desc>POS tagging, lemmatization and dependency parsing done with Stanza  ver. 1.3.0</desc>
        </application>
matyaskopp commented 1 year ago

@nemeek, I have updated the ticks bove You have updated TEI.ana version, which now does not correspond to the TEI version. eg:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="et" xml:id="ParlaMint-EE_2022-01-13.ana" ana="#covid">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title xml:lang="en" type="main">Estonian parliamentary corpus ParlaMint-EE, 2022-01-13 [ParlaMint.ana SAMPLE]</title>
        <title xml:lang="en" type="sub">XIV Parliament term, VII istungjärk, Plenary Assembly meeting on Thursday, 13.01.2022, 10:00</title>
        <title xml:lang="et" type="sub">XIV Riigikogu, VII istungjärk, täiskogu istung Neljapäev, 13.01.2022, 10:00</title>
        <meeting ana="#parla.uni #parla.sitting" n="2022-01-13">2022-01-13</meeting>
        <meeting ana="#parla.uni #parla.session" n="rs7">VII regular session</meeting>
        <meeting ana="#parla.uni #parla.term #parliament.RK14">XIV Riigikogu</meeting>

vs.


<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="et" xml:id="ParlaMint-EE_2022-01-13" ana="#covid">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title type="main" xml:lang="en">Estonian parliamentary corpus ParlaMint-EE, 2015-01-12 [ParlaMint SAMPLE]</title>
        <meeting ana="#parla.meeting.regular">Korraline istung</meeting>

And still, some (add file content classification, notes are not recognized in speeches) suggestions are not implemented in TEI.ana version. If anything is unclear in https://github.com/clarin-eric/ParlaMint/issues/495#issue-1470191277 please ask. I will provide you with a sample or try to clarify it.

nemeek commented 1 year ago

Now the TEI files are updated also according to TEI.ana pattern. What do you mean by 'notes are not recognized in speeches', the additional info about sitting is encoded into <title type="sub" elements.

nemeek commented 1 year ago

Stanza

* [ ]  Stanza

I believe that the label should be Stanza https://github.com/Keeleressursid/ParlaMint/blob/9d8494c2a69ee2f1957e47b2968346e9776ae5ee/Data/ParlaMint-EE/ParlaMint-EE.ana.xml#L111

        <application ident="Stanza" version="1.3.0">
          <label>EstNLTK</label>
          <desc>POS tagging, lemmatization and dependency parsing done with Stanza  ver. 1.3.0</desc>
        </application>

Maybe. Tool is EstNLTK, model is Stanza.

matyaskopp commented 1 year ago

source url

url is not matching the file, it is referring to 2016-09-26

https://github.com/Keeleressursid/ParlaMint/blob/2d7391cc5a3ebee96dfc26e5a80bddc270ac1a9f/Data/ParlaMint-EE/ParlaMint-EE_2022-01-13.xml#L45

          <idno type="URI">https://stenogrammid.riigikogu.ee/et/201609261500</idno>

should be

          <idno type="URI">https://stenogrammid.riigikogu.ee/et/202201131000</idno>
matyaskopp commented 1 year ago

@nemeek I rechecked your sample and ticked off the solved suggestions. There are still some left.

martinmolder commented 1 year ago

@matyaskopp I have a comment here about notes inside speeches. Sometimes they are marked in brackets inside the text or instead of the text, but sometimes the edited transcripts contain part of what the speaker said in brackets. This means that it is not possible to extract the notes from the texts automatically. And as a manual task - reading through all the transcripts and marking them manually - this is beyond the scope of what we can do.

matyaskopp commented 1 year ago

Is it possible to make a compromise? I believe that there are common notes that are repeated so that they can be searched and annotated. I don't know your full data, but if you can grep all notes and annotate the most common ones...

martinmolder commented 1 year ago

Will look into that and get back to you here about a possible solution.

martinmolder commented 1 year ago

Alright, we can create a manual list of most likely notes (which will probably include most, but definitely not all notes; and will probably have a few instances of text that are not actually notes) and use that to mark the notes inside speeches.

nemeek commented 1 year ago

EE data should be OK now.

matyaskopp commented 1 year ago

@martinmolder , @nemeek I have rechecked your data, and updated ticks: