centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

Modify the script for dagw: add a fixed source:dagw, rename the origi… #268

Closed TTTTao725 closed 1 month ago

TTTTao725 commented 2 months ago

…nal source: sub-source, move the sub-source to metadata, add domain filtering, add cleaning function.

Still working on the data sheets.

TTTTao725 commented 2 months ago

add the script for converting: https://huggingface.co/datasets/alexandrainst/nordjylland-news-summarization

TTTTao725 commented 2 months ago

languages: da

I'm not sure what else do we have :(

peterbjorgensen commented 2 months ago

Datasheet looks very good. @peterbjorgensen let me know your thoughts on this

I notice that huggingface uses --- to separate the yaml from the Markdown. I don't know if this a hard requirement for the huggingface hub and libraries to work correctly?

peterbjorgensen commented 2 months ago

So with respect to this decision https://github.com/centre-for-humanities-computing/danish-foundation-models/issues/266#issuecomment-2082092835 we need to update the dataset conversion script such that each sub-source is split into different datasets and each should have a separate dataset card.

KennethEnevoldsen commented 2 months ago

I notice that huggingface uses --- to separate the yaml from the Markdown. I don't know if this a hard requirement for the huggingface hub and libraries to work correctly?

I believe it is

TTTTao725 commented 2 months ago

Thanks guys, I'll fix it this week :)

TTTTao725 commented 1 month ago

Hi guys, how do you think of those markdowns I made automatically? 🤩

KennethEnevoldsen commented 1 month ago

@TTTTao725 let us get this merged in as well and then create a separate PR where we address some of the issues raised by the dataset validator #269.

TTTTao725 commented 1 month ago

No problem!