Closed jonatankrause closed 7 months ago
Specifically, I am talking about the ParlaMint-DK.txt directory (I am not so tech savvy, so wasn't sure how to work with the .xml files).
Haha, I guess there are actually five variables with missing values, but just wasn't sure how much of it was a bug and how much of it had to do with possibly missing source data (since the agenda variable was present in the other Danish dataset, I thought it might be a bug)
You are right. There are no terms in the DK corpus already reported here: #711
You can explore the corpus here:
Hi @matyaskopp,
Thank you. Yes, I saw that the terms issue had been reported.
Thanks for the links. I'm not familiar with the NoSketch engine ... does this allow me to generate a dataset that includes the missing values?
If not, Tomaž Erjavec said that you were currently working on a new release. Do you know approximately when it will come out, and whether it will include the missing variables (am just interested in the timeframe, as I'm currently working on a project).
Hope you can help - thanks in advance,
Hi again @matyaskopp
Just to let you know that I encountered a similar issue with Terms as well as Session, Meeting, and (not least) Agenda variables missing - this time in the Swedish parlamint corpus (the english-language version). Looks like it's missing throughout the whole corpus. I don't know if this is just a general bug, but just wanted to point it out in case you hadn't seen.
Thank you - I look forward to the next release :).
If not, Tomaž Erjavec said that you were currently working on a new release. Do you know approximately when it will come out, and whether it will include the missing variables (am just interested in the timeframe, as I'm currently working on a project).
The next release will be soon (sorry not specifying what soon means). The release will be announced at https://www.clarin.eu/parlamint.
Just to let you know that I encountered a similar issue with Terms as well as Session, Meeting, and (not least) Agenda variables missing - this time in the Swedish parlamint corpus (the english-language version).
Agenda is available only in the Czech corpus, because it splits files by the topic, so everyone can follow the discussion of one topic over the whole corpus.
Looks like it's missing throughout the whole corpus. I don't know if this is just a general bug, but just wanted to point it out in case you hadn't seen.
ParlaMint-SE contains only sittings (894 different values) and terms (7 different values). I will test it more precisely and report it in a separate issue.
Thanks @matyaskopp
Agenda is available only in the Czech corpus, because it splits files by the topic, so everyone can follow the discussion of one topic over the whole corpus.
Ah, okay. So future parlamint releases will include information on agenda in only the Czech corpus as well?
I'm just asking because the "The Danish Parliament Corpus 2009 - 2017, v1" on Clarin (EDIT: version 2: https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/44) contains the variable "Agenda title" that allows you to see the formal agenda under discussion for each speech (e.g. "Negotiation of F 14: On Greenland's Economy", "Negotiation of F 7: About the future of municipalities and regions"), which can be very useful.
I thought that if this dataset and the parlamint datasets was built from the same data source maybe it would be possible to include the Agenda title variables?
Ah, okay. So future parlamint releases will not include information on agenda in only the Czech corpus as well?
Not in 3.1, but I don't expect it either in future releases. If this information is stored in stenographic notes, it can be not easy to parse it because it also can contain some typos, so it is too much to ask every partner to split their XML files by topic.
I'm just asking because the "The Danish Parliament Corpus 2009 - 2017, v1" on Clarin (https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/8) contains the variable "Agenda title" that allows you to see the formal agenda under discussion for each speech (e.g. "Negotiation of F 14: On Greenland's Economy", "Negotiation of F 7: About the future of municipalities and regions"), which can be very useful.
ParlaMint-DK contain some info in the text:
<div type="debateSection">
<head>1. behandling af B 30: Om grænsekontrol ved indrejse fra Sverige til Danmark.</head>
<note type="agendaItem">2020-01-07-2</note>
The note here is a bit useless because it is almost unique among the corpus. The prefix is a date so the same topic has a different value if a date is different. You can only link the same topic discussed on the same day.
the head
contains various values, most frequent ones with number of occurrences:
2026 <head>Punkt 0</head>
186 <head>Besvarelse af oversendte spørgsmål til ministrene (spørgetid).</head>
81 <head>Indstilling fra Udvalget til Valgs Prøvelse</head>
36 <head>Spørgetime med statsministeren.</head>
35 <head>Spørgsmål om meddelelse af orlov til og indkaldelse af stedfortræder for</head>
20 <head>Udvidet spørgetime med statsministeren.</head>
13 <head>Meddelelser fra formanden</head>
11 <head>Forhandling af R 1: Om statsministerens åbningsredegørelse.</head>
10 <head>Valg af stående udvalg m.v.</head>
10 <head>Valg af formand.</head>
If I search for CAPITAL_LETTER SPACE NUMBER COLON, then I get more helpful(?) information, but I am still not sure if it is correctly assigned to the unique discussed topic.
# occurrences topic_identification
cat 20*/*|grep -o '<head.*>'| grep -Po '[A-Z] [0-9]+:'|sort|uniq -c |sort -nr|head
42 L 1:
27 L 6:
27 L 5:
27 L 4:
25 L 41:
25 L 155:
25 L 134:
24 L 9:
24 L 99:
24 L 97:
You can try to do some analysis, but this information will probably never be correctly encoded in the corpus. So, it is better to not have it at all instead of introducing confusion. You can try to contact the authors (@BartJongejan, @constanza1) of the corpus and try to motivate them or help them to have this information in the subsequent releases (ParlaMint 4.0 ??)
Thank you @matyaskopp .
Ah, okay - I didn't realise that this information was present in the XML files in the ParlaMint-DK (am familiar with only a narrow range of formats).
I don't know - it seems like the pattern search you jotted down there did a pretty good job, so a very effective one could probably be generated quite easily (which I guess they must have done with the "The Danish Parliament Corpus 2009 - 2017, v1"). I'd be happy to help @BartJongejan or @CONSTANZA1 with anything falling within my area of competence (which, as alluded to, is limited on the programming side of things - I have mostly worked with pretty clean dataframes in R & python).
- But maybe it would be more useful to ask the authors of the The Danish Parliament Corpus 2009 - 2017, v1 (https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/8) how they extracted the information ...
Dorte is also in the DK team, but I do not know her GitHub nick, so I did not mention here: https://github.com/clarin-eric/ParlaMint/blob/643f902481a47e942b713febe9613c9f5472ea82/Samples/ParlaMint-DK/ParlaMint-DK.xml#L50-L55 Probably some information get lost during the conversion
Ah alright. Thanks for your help.
Just a correction: I mistakingly referenced version 1 of The Danish Parliament Corpus 2009 - 2017 above. It's version 2 (https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/44) that has the .txt files with extracted agenda metadata.
Yup, this bug (i.e. missing DK terms) still exists in 4.0, I just noticed it when trying to generate a table with overview info on the corpora. It seems like DK terms will be added in the Future (milestone),
This issue has been solved some time ago (cf. #711), so, closing this one too.
Hi,
I noticed some missing values when I was playing around with the Danish corpus. I wrote to the email on the page, but got redirected here. Specifically, I found the following to be missing:
I noticed that the agenda information is present in the 2009-2017 Danish dataset, but I am interested in the more up-to-date parlamint corpora.
Thank you! please ask if I can clarify anything. I added a screenshot of the issue below: