The project "NoTraM - Norwegian Transformer Model" is owned by the National Library of Norway.
The Norwegian Colossal Corpus is an open text corpus comparable in size and quality with available datasets for English.
The core of the corpus is based on a unique project started in 2006. In the digitalisation project the goal has veeb to digitize and store all content ever published in Norwegian. In addition we have added multiple other public sources of Norwegian text. Details about the sources as well as how they are built are available in the Colossal Norwegian Corpus Description.
Corpus | License | Size | Words | Documents | Avg words per doc |
---|---|---|---|---|---|
Library Newspapers | CC0 1.0 | 14.0 GB | 2,019,172,625 | 10,096,424 | 199 |
Library Books | CC0 1.0 | 6.2 GB | 861,465,907 | 24,253 | 35,519 |
LovData CD | NLOD 2.0 | 0.4 GB | 54,923,432 | 51,920 | 1,057 |
Government Reports | NLOD 2.0 | 1.1 GB | 155,318,754 | 4,648 | 33,416 |
Parliament Collections | NLOD 2.0 | 8.0 GB | 1,301,766,124 | 9,528 | 136,625 |
Public Reports | NLOD 2.0 | 0.5 GB | 80,064,396 | 3,365 | 23,793 |
Målfrid Collection | NLOD 2.0 | 14.0 GB | 1,905,481,776 | 6,735,367 | 282 |
Newspapers Online | CC BY-NC 2.0 | 3.7 GB | 541,481,947 | 3,695,943 | 146 |
Wikipedia | CC BY-SA 3.0 | 1.0 GB | 140,992,663 | 681,973 | 206 |
The easiest way to access the corpus is to download from HuggingFace. This site explains in details how the corpus can be used. It also gives an extensive information about the content of the corpus, as well as how to filter out certain part of the corpus and how it can be combined with other Norwegian datasets like MC4 and OSCAR.
In addition to the corpus itself we do provide a set of scripts for creating and cleaning corpus files. We also provide a guide where you can follow us in creating a corpus for your data sources step-by-Step Guide about how to create corpus file, and a description about how to create and upload a HuggingFace dataset. Other tools and guides can also be found on our Guides Page. We have made all our software available for anyone to use. Most of it is written in python 3.
The following pretrained models are available. These models have to be finetuned on a specific task. The finetuning is straight forward if you have a dataset available. Please take a look at the Colabs below for sample code. Often you will only need to change a couple of lines of code to adapt it to your task. | Name | Description | Model |
---|---|---|---|
nb‑bert‑base | The original model based on the same structure as BERT Cased multilingual model. Even if it is trained mainly on Norwegian text, it does also maintain some of the multilingual capabilities. Especially it has good scores on Swedish, Danish and English. | 🤗 Model | |
nb‑bert‑large | The model is based on the BERT-large-uncased architecture. For classification tasks, this model will give the best results. Since it is uncased it might not give as good results on NER-tasks. It might require more processing power both for finetuning and for inference. | 🤗 Model |
These models are finetuned on a specific task, and can be used directly.
Name | Description | Model |
---|---|---|
nb‑bert‑base‑mnli | The nb-bert-base-model finetuned on the mnli task. See model page for more details. | 🤗 Model |
saattrupdan/nbailab‑nb‑basenb‑nernb‑scandi | This NER model is trained by Dan Saatrup on top of our nb-bert-base. It has been fine-tuned on the concatenation of DaNE, NorNE, SUC 3.0 and the Icelandic and Faroese parts of the WikiANN dataset. The model yields better results on Norwegian NER tasks than the models only finetuned on Norwegian. See model page for more details. | 🤗 Model |
The NB-BERT-Base model is thoroughly tested in the article cited below. Here are some of our results: | Task | mBERT-base | NB-BERT-base |
---|---|---|---|
POS - NorNE - Bokmål | 98.32 | 98.86 | |
POS - NorNE - Nynorsk | 98.08 | 98.77 | |
NER - NorNE - Bokmål | 81.75 | 90.03 | |
NER - NorNE - Nynorsk | 84.69 | 87.67 | |
Classification - ToN - Frp/SV | 73.75 | 77.49 | |
Sentence-level binary sentiment classification | 73.27 | 84.04 |
The original models need to be fine-tuned for the target task. A typical task is classification, and it is then recommeded that you train a top fully connected layer for this specific task. The following notebooks will allow you to both test the model, and to train your own specialised model on top of our model. Especially the notebook about classification models that trains a sentiment classification task, can very easily be adapted to training any NLP classification task.
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence. In no event shall the owner of the models (The National Library of Norway) be liable for any results arising from the use made by third parties of these models.
If you use our models or our corpus, please cite our article:
@inproceedings{kummervold-etal-2021-operationalizing,
title = {Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model},
author = {Kummervold, Per E and
De la Rosa, Javier and
Wetjen, Freddy and
Brygfjeld, Svein Arne},
booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)},
year = {2021},
address = {Reykjavik, Iceland (Online)},
publisher = {Link{\"o}ping University Electronic Press, Sweden},
url = {https://aclanthology.org/2021.nodalida-main.3},
pages = {20--29},
abstract = {In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokm{\aa}l and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.},
}