Open KennethEnevoldsen opened 10 months ago
One of the taggers in Dolma is using Microsoft Presidio for PII https://microsoft.github.io/presidio/ It is a bit unclear to me (also after reading their documentation) whether it works in Danish. But I think it boils down to whether a PII classifier exists in Danish. The framework can use spaCy, stanza and transformers models.
^we can def. use a spacy pipeline (large embeddings model) I believe everything else it too slow. But there is no PII classifer for danish (only NER)
Get an overview of filters:
See the filters are valid
[ ] Apply all filters to DAGW (whole dataset) and see if any of the filters are problematic (non should filter out extreme amount of data as we consider the dataset fairly clean)
Starting applying filters to the dataset
Decide on reasonable threshold
@peterbjorgensen does this seems like a reasonable approach to you as well?