centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

Add data from Cellar #298

Open kris927b opened 4 weeks ago

kris927b commented 4 weeks ago

Cellar is a repo of publications from the European Union managed by the European Publications Office (Cellar).

What knowledge does Cellar contain? EU legal knowledge Information on EU Policy Research & educational knowledge Organizational view of the EU Historical knowledge for EU Public procurement documents (soon) Documents from other knowledge domains

Note: The EURLex data is contained in the Cellar, so should be filtered out or removed as a separate dataset.

saattrupdan commented 4 weeks ago

Regarding the EURLex overlap: This will probably be handled automatically during deduplication anyway, I suppose?

kris927b commented 4 weeks ago

Yeah. Using deduplication there should be no problem in this. I guess only reason to remove them prior would be to minimise preprocessing time?