clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

DK: Missing values for term, session etc. #750

Closed jonatankrause closed 7 months ago

jonatankrause commented 1 year ago

Hi,

I noticed some missing values when I was playing around with the Danish corpus. I wrote to the email on the page, but got redirected here. Specifically, I found the following to be missing:

I noticed that the agenda information is present in the 2009-2017 Danish dataset, but I am interested in the more up-to-date parlamint corpora.

Thank you! please ask if I can clarify anything. I added a screenshot of the issue below:

Screenshot 2023-09-07 at 21 04 28

jonatankrause commented 1 year ago

Specifically, I am talking about the ParlaMint-DK.txt directory (I am not so tech savvy, so wasn't sure how to work with the .xml files).

jonatankrause commented 1 year ago

Haha, I guess there are actually five variables with missing values, but just wasn't sure how much of it was a bug and how much of it had to do with possibly missing source data (since the agenda variable was present in the other Danish dataset, I thought it might be a bug)

matyaskopp commented 1 year ago

You are right. There are no terms in the DK corpus already reported here: #711

You can explore the corpus here:

terms: https://www.clarin.si/ske/#text-type-analysis?corpname=parlamint30_dk&tab=basic&filter=containing&onecolumn=1&wlattr=speech.term&wlminfreq=1&include_nonwords=1&itemsPerPage=50&showresults=1&cols=%5B%22frq%22%5D&wlsort=frq

sessions: https://www.clarin.si/ske/#text-type-analysis?corpname=parlamint30_dk&tab=basic&filter=containing&onecolumn=1&wlattr=speech.session&wlminfreq=1&include_nonwords=1&itemsPerPage=50&showresults=1&cols=%5B%22frq%22%5D&wlsort=frq

meetings: https://www.clarin.si/ske/#text-type-analysis?corpname=parlamint30_dk&tab=basic&filter=containing&onecolumn=1&wlattr=speech.meeting&wlminfreq=1&include_nonwords=1&itemsPerPage=50&showresults=1&cols=%5B%22frq%22%5D&wlsort=frq

sitting: https://www.clarin.si/ske/#text-type-analysis?corpname=parlamint30_dk&tab=basic&filter=containing&onecolumn=1&wlattr=speech.sitting&wlminfreq=1&include_nonwords=1&itemsPerPage=50&showresults=1&cols=%5B%22frq%22%5D&wlsort=frq

jonatankrause commented 1 year ago

Hi @matyaskopp,

Thank you. Yes, I saw that the terms issue had been reported.

Thanks for the links. I'm not familiar with the NoSketch engine ... does this allow me to generate a dataset that includes the missing values?

If not, Tomaž Erjavec said that you were currently working on a new release. Do you know approximately when it will come out, and whether it will include the missing variables (am just interested in the timeframe, as I'm currently working on a project).

Hope you can help - thanks in advance,

jonatankrause commented 1 year ago

Hi again @matyaskopp

Just to let you know that I encountered a similar issue with Terms as well as Session, Meeting, and (not least) Agenda variables missing - this time in the Swedish parlamint corpus (the english-language version). Looks like it's missing throughout the whole corpus. I don't know if this is just a general bug, but just wanted to point it out in case you hadn't seen.

Thank you - I look forward to the next release :).

matyaskopp commented 1 year ago

If not, Tomaž Erjavec said that you were currently working on a new release. Do you know approximately when it will come out, and whether it will include the missing variables (am just interested in the timeframe, as I'm currently working on a project).

The next release will be soon (sorry not specifying what soon means). The release will be announced at https://www.clarin.eu/parlamint.

Just to let you know that I encountered a similar issue with Terms as well as Session, Meeting, and (not least) Agenda variables missing - this time in the Swedish parlamint corpus (the english-language version).

Agenda is available only in the Czech corpus, because it splits files by the topic, so everyone can follow the discussion of one topic over the whole corpus.

Looks like it's missing throughout the whole corpus. I don't know if this is just a general bug, but just wanted to point it out in case you hadn't seen.

ParlaMint-SE contains only sittings (894 different values) and terms (7 different values). I will test it more precisely and report it in a separate issue.

jonatankrause commented 1 year ago

Thanks @matyaskopp

Agenda is available only in the Czech corpus, because it splits files by the topic, so everyone can follow the discussion of one topic over the whole corpus.

Ah, okay. So future parlamint releases will include information on agenda in only the Czech corpus as well?

I'm just asking because the "The Danish Parliament Corpus 2009 - 2017, v1" on Clarin (EDIT: version 2: https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/44) contains the variable "Agenda title" that allows you to see the formal agenda under discussion for each speech (e.g. "Negotiation of F 14: On Greenland's Economy", "Negotiation of F 7: About the future of municipalities and regions"), which can be very useful.

I thought that if this dataset and the parlamint datasets was built from the same data source maybe it would be possible to include the Agenda title variables?

matyaskopp commented 1 year ago

Ah, okay. So future parlamint releases will not include information on agenda in only the Czech corpus as well?

Not in 3.1, but I don't expect it either in future releases. If this information is stored in stenographic notes, it can be not easy to parse it because it also can contain some typos, so it is too much to ask every partner to split their XML files by topic.

I'm just asking because the "The Danish Parliament Corpus 2009 - 2017, v1" on Clarin (https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/8) contains the variable "Agenda title" that allows you to see the formal agenda under discussion for each speech (e.g. "Negotiation of F 14: On Greenland's Economy", "Negotiation of F 7: About the future of municipalities and regions"), which can be very useful.

ParlaMint-DK contain some info in the text:

         <div type="debateSection">
            <head>1. behandling af B 30: Om grænsekontrol ved indrejse fra Sverige til Danmark.</head>
            <note type="agendaItem">2020-01-07-2</note>

The note here is a bit useless because it is almost unique among the corpus. The prefix is a date so the same topic has a different value if a date is different. You can only link the same topic discussed on the same day.

the head contains various values, most frequent ones with number of occurrences:

   2026 <head>Punkt 0</head>
    186 <head>Besvarelse af oversendte spørgsmål til ministrene (spørgetid).</head>
     81 <head>Indstilling fra Udvalget til Valgs Prøvelse</head>
     36 <head>Spørgetime med statsministeren.</head>
     35 <head>Spørgsmål om meddelelse af orlov til og indkaldelse af stedfortræder for</head>
     20 <head>Udvidet spørgetime med statsministeren.</head>
     13 <head>Meddelelser fra formanden</head>
     11 <head>Forhandling af R 1: Om statsministerens åbningsredegørelse.</head>
     10 <head>Valg af stående udvalg m.v.</head>
     10 <head>Valg af formand.</head>

If I search for CAPITAL_LETTER SPACE NUMBER COLON, then I get more helpful(?) information, but I am still not sure if it is correctly assigned to the unique discussed topic.

# occurrences topic_identification
cat 20*/*|grep -o '<head.*>'| grep -Po '[A-Z] [0-9]+:'|sort|uniq -c |sort -nr|head
     42 L 1:
     27 L 6:
     27 L 5:
     27 L 4:
     25 L 41:
     25 L 155:
     25 L 134:
     24 L 9:
     24 L 99:
     24 L 97:

You can try to do some analysis, but this information will probably never be correctly encoded in the corpus. So, it is better to not have it at all instead of introducing confusion. You can try to contact the authors (@BartJongejan, @constanza1) of the corpus and try to motivate them or help them to have this information in the subsequent releases (ParlaMint 4.0 ??)

jonatankrause commented 1 year ago

Thank you @matyaskopp .

Ah, okay - I didn't realise that this information was present in the XML files in the ParlaMint-DK (am familiar with only a narrow range of formats).

I don't know - it seems like the pattern search you jotted down there did a pretty good job, so a very effective one could probably be generated quite easily (which I guess they must have done with the "The Danish Parliament Corpus 2009 - 2017, v1"). I'd be happy to help @BartJongejan or @CONSTANZA1 with anything falling within my area of competence (which, as alluded to, is limited on the programming side of things - I have mostly worked with pretty clean dataframes in R & python).

jonatankrause commented 1 year ago
matyaskopp commented 1 year ago

Dorte is also in the DK team, but I do not know her GitHub nick, so I did not mention here: https://github.com/clarin-eric/ParlaMint/blob/643f902481a47e942b713febe9613c9f5472ea82/Samples/ParlaMint-DK/ParlaMint-DK.xml#L50-L55 Probably some information get lost during the conversion

jonatankrause commented 1 year ago

Ah alright. Thanks for your help.

Just a correction: I mistakingly referenced version 1 of The Danish Parliament Corpus 2009 - 2017 above. It's version 2 (https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/44) that has the .txt files with extracted agenda metadata.

TomazErjavec commented 10 months ago

Yup, this bug (i.e. missing DK terms) still exists in 4.0, I just noticed it when trying to generate a table with overview info on the corpora. It seems like DK terms will be added in the Future (milestone),

TomazErjavec commented 7 months ago

This issue has been solved some time ago (cf. #711), so, closing this one too.