centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

Url tagger #241

Closed peterbjorgensen closed 2 months ago

peterbjorgensen commented 4 months ago

This adds an URL tagger based on the categories in the UT1 banlists.

rlrs commented 4 months ago

Merging this after robots.txt banlist is added.

peterbjorgensen commented 2 months ago

We will apply these filters on future datasets, but was not used for 2024v1