PolMine / GermaParl2

GermaParl corpus of plenary protocols (v2)
0 stars 0 forks source link

The end of debates is missed in a number of protocols #1

Closed ChristophLeonhardt closed 6 months ago

ChristophLeonhardt commented 1 year ago

Issue

Occasionally, the end of the debates is not recognized. In consequence, appendices or speeches which weren't held during the debate but entered into the minutes later are accidentally included in the final TEI and thus in the CWB corpus. This does not conform to the regular approach to only include speeches held during the debate. The extent of this extra content varies widely between protocols, from single lines to multiple additional speeches.

Example of the Issue

A preliminary analysis of this issue suggests that this is the case in quite a few sessions in the 2nd, the 17th and in particular in the 18th legislative period. It still occurs in other legislative periods as well, albeit to a lesser extent.

As an example see protocol 18/200.

Discussion

At first glance, there seem to be multiple reason for this:

ChristophLeonhardt commented 6 months ago

GermaParl v2.0.1 addresses this, in particular by adding additional regular expressions to detect the end of debates.

While this does not necessarily mean that all appendices are removed, this addresses this initial issue. I will close this for now.

If this issue is observed again, a new issue should be opened.