Closed jankounchained closed 5 months ago
Looks good Jan - feel free to merge this in - will you run it on the whole NCC as well?
already did:)
if you mean the train split of the dataset on huggingface
dfm-data/pre-training/ncc/documents/ncc.jsonlgz
Wonderful!
Added a minimally cleaned Norwegian Colossal Corpus.
The processed files are in
dfm-data/pre-training/ncc
.duplicates The data already comes with paragraph-level deduplication according to the paper. I'm still running our deduplication as a formality so that the
bff_duplicate_paragraph_spans
folder isn't empty.language filtering A language tag from fasttext exists in the metadata. I decided not to use it in the end, because the share of tokens in non-nordic languages seems quite small to me. @KennethEnevoldsen do you agree?
Non-Germanic languages make up about 1.3% of the corpus.
Non-North-Germanic languages are 8.6% of the tokens.
This difference is mainly due to English, 6.7% tokens. Danish has 13.7% tokens.
ocr A big part of the corpus comes from OCR & the authors haven't done much about OCR errors.
Quote: "We have not seen any indication that the OCR errors negatively impacted the performance."
But, they also indicate that OCR quality is time-dependent, better or worse depending on a document's age.
licensing online newspapers (7.6% tokens) have a non-commercial clause (CC BY-NC 2.0). the wikipedia dump (2% of tokens) has a share-alike clause (CC BY-SA 3.0). depending on where the project goes, these may need to be dropped? rest of the data is released under very permissive licenses.