Icelandic NLP resources
This is an list of known tools and resources developed specifically to do linguistic processing in Icelandic. It is intended to give readers a clear overview of the ever-growing arsenal of tools for working with Icelandic natural language data at a glance.
This list is categorized by task to increase clarity. Due to that, some multi-functional tools and toolkits might appear more than once in the list.
If you notice a category or resource is missing or have suggestions on how to improve this list, please open a GitHub pull request. For those who do not know how to make pull requests, you can also create an issue with your GitHub account.
Contents
Notable papers and reports ↑
Other resource collections ↑
- CLARIN-IS
- The Icelandic branch of the CLARIN-ERIC language resource initiative. Contains information on and downloads for many tools and datasets.
- SÍM homepage
- Overview page for SÍM (the Icelandic Language Technology Consortium), which contains mirrors and descriptions for all Language Technology Programme projects.
- malfong.is
- List of language technology resources, maintained by Árnastofnun.
- Comprehensive list of language resources
- This list of over 100 Icelandic language technology resources was compiled by @bjarnigithub in the summer of 2021.
Corpora ↑
- Talrómur
- A large public domain TTS corpus designed for research and development. Contains over 160 hours of studio-recorded prompted speech, divided between 8 speakers.
- Samrómur
- An open and accessible speech recognition dataset with FLAC audio files, corresponding text and metadata.
- Icelandic broadcast speech
- 193 hours of radio and TV data from the Icelandic National Broadcasting Service (RÚV).
- Spjallromur
- Icelandic Conversational Speech
- Kennslurómur
- Icelandic lectures with audio and corresponding text.
- GreynirCorpus
- A large, parsed treebank of modern Icelandic text
European Language Grid Services ↑
Toolkits ↑
- Java toolkit which does tokenization, POS tagging, lemmatization, parsing and NER
- Developed by Hrafn Loftsson
- TTS frontend designed to work with the Merlin speech synthesis system developed by CSTR
- It contains a pronunciation dictionary, sequitur g2p model, stress analysis component and more. Unfortunately it does not include any documentation.
- Developed by Anna Björk Nikulásdóttir at LVL
Tokenization and text normalization ↑
POS tagging ↑
Syntactic parsing ↑
Grapheme-to-phoneme ↑
Stress analysis ↑
Speech synthesis ↑
Speech recognition ↑
Our CADIA-LVL works in progress
You can also see our many works in progress at LVL itself if you follow us on our github: https://github.com/cadia-lvl \
Facebook page: https://www.facebook.com/languageandvoice/