This is a complete pipeline to create a comparable/parallel corpus made of European Parliament's proceedings enriched with speakers' metadata.
This pipeline has been tested in macOS Sierra, it should work in UNIX too. Basically, Python 3 is required for almost every script. Some Python modules and/or tools might be needed too. Check specific requirements for each script.
Related projects:
localization
, resources for adaptation of scripts to different languages.LICENSE
, MIT license.README.md
, this file.add_metadata.py
, script to add MEPs metadata (CSV) to proceedings (XML).add_sentences.py
, script to split text in sentences with NLTK's Punkt tokenizer.compile.sh
, script to run the whole pipeline to compile the EuroParl corpus.dates.txt
, one date per line in format YYYY-MM-DD.get_meps.py
, script to scrap MEPs information.get_proceedings.py
, script to scrap Proceedings of the European Parliament.langid_filter.py
, filter out paragraphs whose real language is not the expected (the same of the proceedings).meps_ie.py
, script to extract MEPs metadata from HTML to CSV.proceedings_txt.py
, script to extract text from HTML proceedings.proceedings_xml.py
, script to model as XML text and metadata from HTML proceedings.translationse_filter.py
, script to classify utterances as original, translations and even by native speaker.treetagger.py
, script to tokenize, lemmatize and tag PoS using TreeTagger producing well-formed XML.You can find the complete pipeline to compile the EuroParl corpus in compile.sh
.
get_proceedings.py
get_meps.py
meps_ie.py
proceedings_xml.py
langid_filter.py
add_metadata.py
add_sentences.py
treetagger.py
translationese_filter.py
add_metadata.py
It adds MEPs' metadata to interventions in the proceedings. Be aware that not all speakers speaking before the European Parliament are MEPs. There are members of other European Institutions, representatives of national institutions, guests, etc. speaking here. There is currently no metadata for those but the information extracted from the very same proceedings.
It reads 3 files containing the metadata:
meps.csv
which contains basic information about the MEP: id, name, nationality, birth date, birth place, death date, death place.national_parties.csv
which contains political affiliation of the MEP in his/her country: id, start date, end date, and name of the party.political_groups.csv
which contains political affiliation at the European Parliament: id, Member State, start date, end date, name of the group, role within the group.For each proceeding in XML it retrieves all interventions whose speaker is an MEP. Then it adds relevant speaker's metadata to the intervention. By relevant we mean the valid information at the day of the session.
add_sentences.py
It splits text contained in a given XML element into sentences.
Using NLTK Punkt Tokenizer.
For each XML file, extracts all elements containing text. Each unit is passed to the tokenizer. It returns a list of sentences, which are converted in subelements of the element which was containing the text.
compile.sh
It runs the whole pipeline in one shot.
It is a shell script running a list of commands sequentially. If no arguments provided it runs the full pipeline for all the supported languages.
One can provide the -l
language argument, to run the pipeline only on a particular language, and -p
pattern argument, to restrict the processing to a year, month, day...
All the requirements listed in this section and a bash shell.
get_meps.py
It downloads the MEPs' information avaliable at the web of the European Parliament in HTML format.
First, it gets a list of all MEPs (past and present).
For each item in this list, it generates an URL and it downloads the page which contains basic information and the history record of the speaker.
get_proceedings.py
It downloads all the proceedings in a particular language version within a range of dates.
If a file with dates is given it generates an URL for each date and downloads the proceedings in HTML format. If no file with dates is provided, it generates all possible dates within a range, and tries to download only those URLs returning a sucessful response.
langid_filter.py
Sometimes, interventions remain untranslated and thus their text appear in their original language. In order to avoid this noise, langid_filter.py
identifies the most probable language of each text unit (namely paragraphs) and remove those paragraphs which are not in the expected language (e.g. Bulgarian fragments found in the English version).
All paragraphs (or units containing the text to be analyzed) are retrieved.
Each unit is analyzed with to language identifiers available for Python: langdetect
and langid
. A series of heuristics are then used to exploit the output of the language analyzers:
If the expected language and the language identified by both tools is the same, the text is in the same language as the language of the version at stake.
If both tools agree in identifying the language which is different to the expected version, the text is in a different language as the expected one, and, thus, removed.
For cases where there is not a perfect agreement a few rules are formalized working fairly well.
meps_ie.py
It extracts MEPs information from semistructured HTML and yields the information in tabular format.
It reads each HTML instance, and using XPath and regular expressions it finds relevant information which is finally serialized as three CSV files:
meps.csv
national_parties.csv
political_groups.csv
proceedings_xml.py
It extracts basic metadata about the parliamentary session, the structure of the text, and about the speakers and the source language of the utterances, and the actual text of the proceedings.
It reads each HTML file, and using XPath and regular expressions it maintains the structure of the debates, and extracts metatextual information about the SL, the speaker, etc.
proceedings_txt.py
It extracts all the text in the HTML proceedings.
It parses the HTML, extracts only text, and clean a bit the output.
translationse_filter.py
It filters out interventions to get:
It reads proceedings in XML and outputs XML with only the relevant paragraphs and their corresponding ancestors.
If native speakers (defined here as someone holding the nationality of a country with the SL as official language), XML with MEPs metadata is required.
It ritrieves all intervention and filters out interventions to keep only:
sl
== lang
sl
!= lang
and sl
!= unknowntreetagger.py
It tokenizes, lemmatizes, tags PoS and splits into sentences a text.
It reads an XML file and annotates the text contained in a given element. It cares of producing well-formed XML as output.
We use the script get_proceedings.py
to:
This is the typical URL for the proceedings of a given day (namely, May 5 2009): http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20090505+ITEMS+DOC+XML+V0//EN&language=EN
# get proceedings for English with defaults
python get_proceedings.py -o /path/to/output/dir -l EN
# get proceedings for Spanish using a list of dates
python get_proceedings.py -o /path/to/output/dir -l ES -d dates.txt
# get proceedings for German using a range of dates between two values
python get_proceedings.py -o /path/to/output/dir -l DE -s 2000-01-01 -e 2004-07-01
The European Parliament website maintains a database with all Members of the European Parliament.
We use the script get_meps.py
to download as HTML the metadata of all MEPs.
python get_meps.py -o /path/to/output/dir
query=full
: all available data. http://www.europarl.europa.eu/meps/en/xml.html?query=fullfilter=all
: all MEPs, alternatively, one can choose an alphabet letter (A, B, ...) to filter only speakers whose family name starts with that letter. http://www.europarl.europa.eu/meps/en/xml.html?query=full&filter=all or http://www.europarl.europa.eu/meps/en/xml.html?query=full&filter=Cleg=0
: all legislatures, integers help to select a past legislature, if no value provided just current legislature. http://www.europarl.europa.eu/meps/en/xml.html?query=full&filter=all&leg=0http://docs.python-guide.org/en/latest/scenarios/scrape/
python proceedings_txt.py -i /path/to/html -o /path/to/output
python proceedings_xml.py -i /path/to/html -o /path/to/xml -l EN
python langid_filter.py -i /path/to/xml -o /path/to/xml
python meps_ie.py -i /path/to/metadata/dir -o /path/to/output/dir
python add_metadata.py -m /path/to/meps.csv -n /path/to/national_parties.csv -g /path/to/political_groups.csv -x /path/to/source/xml/dir -p "*.xml" -o /path/to/output/xml/dir
# English originals in English proceedings
python translationese_filter.py -i /path/to/english/proceedings -o /path/to/output/dir -l en
# All translations in English proceedings
python translationese_filter.py -i /path/to/english/proceedings -o /path/to/output/dir -l all
# Translations from English in Spanish proceedings
python translationese_filter.py -i /path/to/spanish/proceedings -o /path/to/output/dir -l en
# English originals in English proceedings by native speakers
python translationese_filter.py -i /path/to/english/proceedings -o /path/to/output/dir -l en -n