The Open Parallel Corpus
This repository contains information about the released parallel corpora and derived data
sets in OPUS, the open collection of parallel corpora. Each sub-directory in corpus/
corresponds to one specific resource with released versions and data sets
according to the following format corpus/name/version
.
Tools for finding and processing OPUS data sets:
Managing OPUS:
Machine translation with OPUS-MT:
Please, cite the following LREC 2012 paper when using OPUS and also acknowledge corpus-specific references as specified in the resource-specific information and documentation!
@InProceedings{TIEDEMANN12.463,
author = {Jörg Tiedemann},
title = {Parallel Data, Tools and Interfaces in {OPUS}},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
}
OPUS and related resources and tools have been partially supported by various projects such as
OPUS is hosted by CSC, the IT Center for Science in Finland, and heavily draws on the HPC resources provided by CSC. OPUS is also part of NLPL, the Nordic Language Processing Laboratory. Last but not least, OPUS would not be possible without the various contributions from the community including aligned data sets and tools to create and process parallel corpora.