NCC loader - Githubissues

jankounchained commented 5 months ago

Added a minimally cleaned Norwegian Colossal Corpus.
The processed files are in dfm-data/pre-training/ncc.

duplicates The data already comes with paragraph-level deduplication according to the paper. I'm still running our deduplication as a formality so that the bff_duplicate_paragraph_spans folder isn't empty.

language filtering A language tag from fasttext exists in the metadata. I decided not to use it in the end, because the share of tokens in non-nordic languages seems quite small to me. @KennethEnevoldsen do you agree?

Non-Germanic languages make up about 1.3% of the corpus.
Non-North-Germanic languages are 8.6% of the tokens.
This difference is mainly due to English, 6.7% tokens. Danish has 13.7% tokens.

ocr A big part of the corpus comes from OCR & the authors haven't done much about OCR errors.
Quote: "We have not seen any indication that the OCR errors negatively impacted the performance."
But, they also indicate that OCR quality is time-dependent, better or worse depending on a document's age.

licensing online newspapers (7.6% tokens) have a non-commercial clause (CC BY-NC 2.0). the wikipedia dump (2% of tokens) has a share-alike clause (CC BY-SA 3.0). depending on where the project goes, these may need to be dropped? rest of the data is released under very permissive licenses.

KennethEnevoldsen commented 5 months ago

Looks good Jan - feel free to merge this in - will you run it on the whole NCC as well?

jankounchained commented 5 months ago

already did:) if you mean the train split of the dataset on huggingface dfm-data/pre-training/ncc/documents/ncc.jsonlgz

KennethEnevoldsen commented 5 months ago

Wonderful!

centre-for-humanities-computing / danish-foundation-models

NCC loader #264