RyokoAI / BigKnow2022

BigKnow2022: Bringing Language Models Up to Speed
14 stars 0 forks source link

Add speech protocols from national parliaments #5

Open michaelbogdan opened 1 year ago

michaelbogdan commented 1 year ago

Most, if not all, national parliaments produce an official transcript of all speeches in parliament for free use. One dataset using this source is ParlSpeech v2, which contains speeches from several democratic national parliaments.

I found CPP-BT, containing all transcripts of the Federal German Bundestag, weighing in just under 500 MB compressed, a rich source for German speech. It can be extended by the transcripts of the Bundesrat and the transcripts of the parliaments of the 16 constituent states. The same author has published several other corpora for decisions of Federal German courts, international courts, Federal German law and international law.

Overall, the almost fifty European countries should be producing literal gigabytes of speech data in the various languages of Europe. Adding the transcripts of the European Parliament might be superflous, as it is already contained in The Pileas EuroParl. It can however be extended by parallel corpora of the European Council, the Council of Europe and more current data.