centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

CulturaX #260

Closed KennethEnevoldsen closed 5 months ago

KennethEnevoldsen commented 5 months ago

Have we considered: https://huggingface.co/datasets/uonlp/CulturaX

Seems to be mostly common crawl content and could imagine colossal C4 is probably a better solution.

@peterbjorgensen I am assuming we have. If that is the case we can simply close this issue (just so that it is here in the future).

KennethEnevoldsen commented 5 months ago

@peterbjorgensen assuming we don't want to include this. So will close it