centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

Dolma url count #251

Closed peterbjorgensen closed 5 months ago

peterbjorgensen commented 6 months ago

Add a simple (single proces) and a complicated (multiprocessing) script to count the domains in a dataset.

peterbjorgensen commented 6 months ago

The domain list is on ucloud here dfm-data/domain-list

KennethEnevoldsen commented 5 months ago

@peterbjorgensen should we have the domain lists on Github (to allow users to remove/add their domains from the list)