Helsinki-NLP/OPUS - Githubissues

The Open Parallel Corpus

website: http://opus.nlpl.eu
github: https://github.com/Helsinki-NLP/OPUS
contact: opus-project AT helsinki DOT fi

This repository contains information about the released parallel corpora and derived data sets in OPUS, the open collection of parallel corpora. Each sub-directory in corpus/ corresponds to one specific resource with released versions and data sets according to the following format corpus/name/version.

The OPUS ecosystem

Tools for finding and processing OPUS data sets:

OpusTools - Python library and tools for accessing and processing OPUS data [pip]
OpusTools-perl - Perl scripts for processing OPUS data
OPUS-API - API for searching OPUS resources [live API]
OpusFilter - a toolbox for filtering and compiling parallel corpora [doc] [pip]
OPUS-search - online search in OPUS data [Europarl v7] [Europarl v3] [OpenSubtitles v1] [OpenSubtitles v2018] [EUconst]
OPUS-dic - online dictionary based on word alignments

Managing OPUS:

OPUS-ingest - recipes for ingesting/importing data to OPUS
OPUS-website - OPUS website and corpus sample files
OPUS-admin - scripts and recipes for admin tasks (restricted access)
OPUS-repository - parallel data management system [frontend] [backend] [live demo]
OPUS-ISA - experimental sentence alignment interface [live demo]

Machine translation with OPUS-MT:

Opus-MT - OPUS-MT web service setup
OPUS-MT-train - scripts and recipes for training OPUS-MT models
OPUS-translator - OPUS-MT web interface [live demo]
OPUS-MT-testsets - a collection of MT benchmarks
OPUS-MT-leaderboard - OPUS-MT evaluation scores and leaderboards [live demo]
OPUS-MT-map - interactive map of OPUS-MT language coverage [live demo]
OPUS-MT-app - desktop app for local translation with OPUS-MT (fork of translateLocally)
OPUS-CAT - OPUS-MT integration in CAT tools

Citing

Please, cite the following LREC 2012 paper when using OPUS and also acknowledge corpus-specific references as specified in the resource-specific information and documentation!

@InProceedings{TIEDEMANN12.463,
  author = {Jörg Tiedemann},
  title = {Parallel Data, Tools and Interfaces in {OPUS}},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
}

Links to other resources

mtdata - a library for retrieving MT datasets
LanguageCodes - Perl modules for managing language codes
eflomal - a tool for efficient word alignment with pre-trained priors from OPUS
the Tatoeba translation challenge - a comprehensive MT dataset compiled from OPUS and Tatoeba
wiki back-translations - over a billion automatically translated sentences
OPUS-SPM - pre-trained sentence piece models from OPUS data

Acknowledgements

OPUS and related resources and tools have been partially supported by various projects such as

LetsMT! - A Platform for Online Sharing of Training Data and Building User Tailored Machine Translation (EU ICT PSP)
MeMAD - Methods for Managing Audiovisual Data (EU Horizon 2020)
NLPL - the Nordic Language Processing Laboritory (neic)
EOSC-nordic - the European Open Science Cloud within the Nordic and Baltic countries (EU Horizon 2020)
ELG - the European Language Grid (EU Horizon 2020)
FoTran - Found in Translation (EU ERC)
HPLT - High-Performance Language Technologies (EU Horizon)

OPUS is hosted by CSC, the IT Center for Science in Finland, and heavily draws on the HPC resources provided by CSC. OPUS is also part of NLPL, the Nordic Language Processing Laboratory. Last but not least, OPUS would not be possible without the various contributions from the community including aligned data sets and tools to create and process parallel corpora.