Closed ChristophLeonhardt closed 6 months ago
GermaParl v2.0.1 addresses this, in particular by adding additional regular expressions to detect the end of debates.
While this does not necessarily mean that all appendices are removed, this addresses this initial issue. I will close this for now.
If this issue is observed again, a new issue should be opened.
Issue
Occasionally, the end of the debates is not recognized. In consequence, appendices or speeches which weren't held during the debate but entered into the minutes later are accidentally included in the final TEI and thus in the CWB corpus. This does not conform to the regular approach to only include speeches held during the debate. The extent of this extra content varies widely between protocols, from single lines to multiple additional speeches.
Example of the Issue
A preliminary analysis of this issue suggests that this is the case in quite a few sessions in the 2nd, the 17th and in particular in the 18th legislative period. It still occurs in other legislative periods as well, albeit to a lesser extent.
As an example see protocol 18/200.
Discussion
At first glance, there seem to be multiple reason for this: