centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

Dataset Cleaning Process #208

Open KennethEnevoldsen opened 10 months ago

KennethEnevoldsen commented 10 months ago

Get an overview of filters:

See the filters are valid

Starting applying filters to the dataset

Decide on reasonable threshold

@peterbjorgensen does this seems like a reasonable approach to you as well?

peterbjorgensen commented 10 months ago

One of the taggers in Dolma is using Microsoft Presidio for PII https://microsoft.github.io/presidio/ It is a bit unclear to me (also after reading their documentation) whether it works in Danish. But I think it boils down to whether a PII classifier exists in Danish. The framework can use spaCy, stanza and transformers models.

KennethEnevoldsen commented 10 months ago

^we can def. use a spacy pipeline (large embeddings model) I believe everything else it too slow. But there is no PII classifer for danish (only NER)