Closed TTTTao725 closed 1 month ago
add the script for converting: https://huggingface.co/datasets/alexandrainst/nordjylland-news-summarization
languages: da
I'm not sure what else do we have :(
Datasheet looks very good. @peterbjorgensen let me know your thoughts on this
I notice that huggingface uses ---
to separate the yaml from the Markdown. I don't know if this a hard requirement for the huggingface hub and libraries to work correctly?
So with respect to this decision https://github.com/centre-for-humanities-computing/danish-foundation-models/issues/266#issuecomment-2082092835 we need to update the dataset conversion script such that each sub-source is split into different datasets and each should have a separate dataset card.
I notice that huggingface uses --- to separate the yaml from the Markdown. I don't know if this a hard requirement for the huggingface hub and libraries to work correctly?
I believe it is
Thanks guys, I'll fix it this week :)
Hi guys, how do you think of those markdowns I made automatically? 🤩
@TTTTao725 let us get this merged in as well and then create a separate PR where we address some of the issues raised by the dataset validator #269.
No problem!
…nal source: sub-source, move the sub-source to metadata, add domain filtering, add cleaning function.
Still working on the data sheets.