chozelinek / europarl

Toolkit to compile a comparable/parallel corpus from European Parliament proceedings
MIT License
15 stars 4 forks source link

EuroParl

This is a complete pipeline to create a comparable/parallel corpus made of European Parliament's proceedings enriched with speakers' metadata.

This pipeline has been tested in macOS Sierra, it should work in UNIX too. Basically, Python 3 is required for almost every script. Some Python modules and/or tools might be needed too. Check specific requirements for each script.

Related projects:

Contents

The pipeline

You can find the complete pipeline to compile the EuroParl corpus in compile.sh.

  1. Download proceedings in HTML with get_proceedings.py
  2. Download MEPs metadata in HTML with get_meps.py
  3. Extract MEPs' information in a CSV file with meps_ie.py
  4. Model proceedings as XML with proceedings_xml.py
  5. Filter out text units not in the expected language (e.g. Bulgarian text in the English version) with langid_filter.py
  6. Add MEPs metadata to proceedings with add_metadata.py
  7. Add sentence boundaries (if needed) with add_sentences.py
  8. Annotate token, lemma, PoS with TreeTagger with treetagger.py
  9. Separate originals from translations and even filter by native speakers with translationese_filter.py

add_metadata.py

What

It adds MEPs' metadata to interventions in the proceedings. Be aware that not all speakers speaking before the European Parliament are MEPs. There are members of other European Institutions, representatives of national institutions, guests, etc. speaking here. There is currently no metadata for those but the information extracted from the very same proceedings.

How

It reads 3 files containing the metadata:

For each proceeding in XML it retrieves all interventions whose speaker is an MEP. Then it adds relevant speaker's metadata to the intervention. By relevant we mean the valid information at the day of the session.

Requirements

add_sentences.py

What

It splits text contained in a given XML element into sentences.

How

Using NLTK Punkt Tokenizer.

For each XML file, extracts all elements containing text. Each unit is passed to the tokenizer. It returns a list of sentences, which are converted in subelements of the element which was containing the text.

Requirements

compile.sh

What

It runs the whole pipeline in one shot.

How

It is a shell script running a list of commands sequentially. If no arguments provided it runs the full pipeline for all the supported languages.

One can provide the -l language argument, to run the pipeline only on a particular language, and -p pattern argument, to restrict the processing to a year, month, day...

Requirements

All the requirements listed in this section and a bash shell.

get_meps.py

What

It downloads the MEPs' information avaliable at the web of the European Parliament in HTML format.

How

First, it gets a list of all MEPs (past and present).

For each item in this list, it generates an URL and it downloads the page which contains basic information and the history record of the speaker.

Requirements

get_proceedings.py

What

It downloads all the proceedings in a particular language version within a range of dates.

How

If a file with dates is given it generates an URL for each date and downloads the proceedings in HTML format. If no file with dates is provided, it generates all possible dates within a range, and tries to download only those URLs returning a sucessful response.

Requirements

langid_filter.py

What

Sometimes, interventions remain untranslated and thus their text appear in their original language. In order to avoid this noise, langid_filter.py identifies the most probable language of each text unit (namely paragraphs) and remove those paragraphs which are not in the expected language (e.g. Bulgarian fragments found in the English version).

How

All paragraphs (or units containing the text to be analyzed) are retrieved.

Each unit is analyzed with to language identifiers available for Python: langdetect and langid. A series of heuristics are then used to exploit the output of the language analyzers:

If the expected language and the language identified by both tools is the same, the text is in the same language as the language of the version at stake.

If both tools agree in identifying the language which is different to the expected version, the text is in a different language as the expected one, and, thus, removed.

For cases where there is not a perfect agreement a few rules are formalized working fairly well.

Requirements

meps_ie.py

What

It extracts MEPs information from semistructured HTML and yields the information in tabular format.

How

It reads each HTML instance, and using XPath and regular expressions it finds relevant information which is finally serialized as three CSV files:

Requirements

proceedings_xml.py

What

It extracts basic metadata about the parliamentary session, the structure of the text, and about the speakers and the source language of the utterances, and the actual text of the proceedings.

How

It reads each HTML file, and using XPath and regular expressions it maintains the structure of the debates, and extracts metatextual information about the SL, the speaker, etc.

Requirements

proceedings_txt.py

What

It extracts all the text in the HTML proceedings.

How

It parses the HTML, extracts only text, and clean a bit the output.

Requirements

translationse_filter.py

What

It filters out interventions to get:

How

It reads proceedings in XML and outputs XML with only the relevant paragraphs and their corresponding ancestors.

If native speakers (defined here as someone holding the nationality of a country with the SL as official language), XML with MEPs metadata is required.

It ritrieves all intervention and filters out interventions to keep only:

  1. originals: if sl == lang
  2. translations: if sl != lang and sl != unknown
  3. translations from language xx

Requirements

treetagger.py

What

It tokenizes, lemmatizes, tags PoS and splits into sentences a text.

How

It reads an XML file and annotates the text contained in a given element. It cares of producing well-formed XML as output.

Requirements

Scrapping European Parliament's proceedings

Get proceedings in HTML

We use the script get_proceedings.py to:

  1. generate a range of dates, or read from file,
  2. for each date,
  3. generate a URL,
  4. request it,
  5. if it exists,
    1. download the document,
  6. else,
    1. proceed with the next date.

This is the typical URL for the proceedings of a given day (namely, May 5 2009): http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20090505+ITEMS+DOC+XML+V0//EN&language=EN

Usage

# get proceedings for English with defaults
python get_proceedings.py -o /path/to/output/dir -l EN
# get proceedings for Spanish using a list of dates
python get_proceedings.py -o /path/to/output/dir -l ES -d dates.txt
# get proceedings for German using a range of dates between two values
python get_proceedings.py -o /path/to/output/dir -l DE -s 2000-01-01 -e 2004-07-01

Scrapping MEPs information

The European Parliament website maintains a database with all Members of the European Parliament.

Get the metadata in HTML

We use the script get_meps.py to download as HTML the metadata of all MEPs.

  1. The script retrieves an XML file containing a list of all MEPs full names and unique IDs: http://www.europarl.europa.eu/meps/en/xml.html?query=full&filter=all&leg=0;
  2. For each MEP, generate an URL to the actual HTML file containing the metadata: http://www.europarl.europa.eu/meps/en/33569/33569_history.html;
  3. request it;
  4. download and proceed with the next.

Usage

python get_meps.py -o /path/to/output/dir

Some notes on querying the database

On web scrapping with Python

http://docs.python-guide.org/en/latest/scenarios/scrape/

Extracting text from HTML proceedings

Usage

python proceedings_txt.py -i /path/to/html -o /path/to/output

Transforming HTML proceedings into XML

Usage

python proceedings_xml.py -i /path/to/html -o /path/to/xml -l EN

Filtering out text not in the expected language

Usage

python langid_filter.py -i /path/to/xml -o /path/to/xml

Extracting MEPs information from HTML

Usage

python meps_ie.py -i /path/to/metadata/dir -o /path/to/output/dir

Adding MEPs' metadata to XML proceedings

Usage

python add_metadata.py -m /path/to/meps.csv -n /path/to/national_parties.csv -g /path/to/political_groups.csv -x /path/to/source/xml/dir -p "*.xml" -o /path/to/output/xml/dir

Filtering text after translationese criteria: original, translations, and restrict to native speakers

Usage

# English originals in English proceedings
python translationese_filter.py -i /path/to/english/proceedings -o /path/to/output/dir -l en

# All translations in English proceedings
python translationese_filter.py -i /path/to/english/proceedings -o /path/to/output/dir -l all

# Translations from English in Spanish proceedings
python translationese_filter.py -i /path/to/spanish/proceedings -o /path/to/output/dir -l en

# English originals in English proceedings by native speakers
python translationese_filter.py -i /path/to/english/proceedings -o /path/to/output/dir -l en -n