Icelandic NLP resources

This is an list of known tools and resources developed specifically to do linguistic processing in Icelandic. It is intended to give readers a clear overview of the ever-growing arsenal of tools for working with Icelandic natural language data at a glance.

This list is categorized by task to increase clarity. Due to that, some multi-functional tools and toolkits might appear more than once in the list. If you notice a category or resource is missing or have suggestions on how to improve this list, please open a GitHub pull request. For those who do not know how to make pull requests, you can also create an issue with your GitHub account.

Notable papers
Other resource collections
Corpora
European Language Grid Services
Toolkits
Tokenization and text normalization
POS tagging
Syntactic parsing
Grapheme-to-phoneme
Stress analysis
Speech synthesis (TTS)
Speech recognition (ASR)

Notable papers and reports ↑

Máltækniáætlun fyrir íslensku 2018-2022 (English version)
- The project plan for an ongoing language technology programme funded by the Icelandic Ministry of Education.
- Short paper describing the programme, note that the programme has been postponed by a year compared to the original plan.
Risamálheild: A Very Large Icelandic Text Corpus
- Paper describing the Icelandic Gigaword Corpus, a tagged and lemmatized corpus containing over 10^9 tokens.
A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System
Please send a pull request with additions to this list. If you create a Github issue with the following details of the paper: title, link/URL to PDF/book, and a short description then we can add it to the website/markdown file.

Other resource collections ↑

CLARIN-IS
- The Icelandic branch of the CLARIN-ERIC language resource initiative. Contains information on and downloads for many tools and datasets.
SÍM homepage
- Overview page for SÍM (the Icelandic Language Technology Consortium), which contains mirrors and descriptions for all Language Technology Programme projects.
malfong.is
- List of language technology resources, maintained by Árnastofnun.
Comprehensive list of language resources
- This list of over 100 Icelandic language technology resources was compiled by @bjarnigithub in the summer of 2021.

Corpora ↑

Talrómur
- A large public domain TTS corpus designed for research and development. Contains over 160 hours of studio-recorded prompted speech, divided between 8 speakers.
Samrómur
- An open and accessible speech recognition dataset with FLAC audio files, corresponding text and metadata.
Icelandic broadcast speech
- 193 hours of radio and TV data from the Icelandic National Broadcasting Service (RÚV).
Spjallromur
- Icelandic Conversational Speech
Kennslurómur
- Icelandic lectures with audio and corresponding text.
GreynirCorpus
- A large, parsed treebank of modern Icelandic text

European Language Grid Services ↑

Toolkits ↑

Greynir

Python 3 package which is capable of syntactic parsing, lemmatization, POS tagging, noun phrase inflection and more
The GitHub repo for this project
Developed by Miðeind ehf.

IceNLP

Java toolkit which does tokenization, POS tagging, lemmatization, parsing and NER
Developed by Hrafn Loftsson

LVL-tts-frontend

TTS frontend designed to work with the Merlin speech synthesis system developed by CSTR
It contains a pronunciation dictionary, sequitur g2p model, stress analysis component and more. Unfortunately it does not include any documentation.
- Developed by Anna Björk Nikulásdóttir at LVL

Tokenization and text normalization ↑

Icelandic tokenizer
Textahaukur - text normalization toolkit
- This seems to be in suspended development and claims to not be functional as of yet.
Regína normalizer
- Regex-based text normalization in python. Currently in early stages of development.

POS tagging ↑

Syntactic parsing ↑

Neural parsing pipeline for Icelandic
- The GitHub repo for this project
Greynir, see above
IceNLP, see above

Grapheme-to-phoneme ↑

Stress analysis ↑

LVL-tts-frontend performs stress analysis

Speech synthesis ↑

Speech recognition ↑

Ice-ASR
Alþingi
- Just the recipe
Samromur ASR
- Contains a vanilla recipe (base), subword modelling, and specialized children and adolescent recipes
alignment and segmentation
- Scripts to prepare RÚV TV data for alignment and segmentation to make an ASR dataset
Tiro Speech Core
Tal

Our CADIA-LVL works in progress

You can also see our many works in progress at LVL itself if you follow us on our github: https://github.com/cadia-lvl \ Facebook page: https://www.facebook.com/languageandvoice/

cadia-lvl / icelandic-NLP-resources

readme