[BENCHMARK DATASET REQUEST] Schibsted Summaries

ScandEval / ScandEval

Evaluation of language models on mono- or multilingual tasks.

https://scandeval.com

MIT License

75 stars 15 forks source link

[BENCHMARK DATASET REQUEST] Schibsted Summaries #526

Closed larsbun closed 4 days ago

larsbun commented 1 month ago

Dataset name

Schibsted Summaries

Dataset link

https://huggingface.co/datasets/Schibsted/schibsted-article-summaries

Dataset languages

[ ] Danish
[X] Swedish
[X] Norwegian (Bokmål or Nynorsk)
[ ] Icelandic
[ ] Faroese
[ ] German
[ ] Dutch
[ ] English

Describe the dataset

This is a json-formatted dataset of articles and human-created (as I gather) summaries from the Schibsted corporation in Norwegian and Swedish.

$ grep -c "\"summary\"" * README.md:0 summary-data-test.jsonl:517 summary-data-train.jsonl:2000 summary-data-validation.jsonl:491

saattrupdan commented 1 month ago

@larsbun Is it possible to separate the Norwegian/Swedish samples in the dataset, or would we have to use a language identification model?

larsbun commented 1 month ago

@larsbun Is it possible to separate the Norwegian/Swedish samples in the dataset, or would we have to use a language identification model?

I didn't know the dataset before someone told me at dinner on Tuesday (i.e., haven't worked on it), but a crude regexp search of the fields gave me this:

article_id newsroom article_title article_text_all summary num_bulletpoints num_words_per_bulletpoint num_words_per_bulletpoint_bucket num_bulletpoints_bucket

Seemingly, there is no language ID there. But I guess a ID model should separate them to a 100%, since the alphabets are different. I just looked through the file and saw some Swedish text.

simeneide commented 2 weeks ago

hey, was notified of this now. didnt give any language identifier, but its pretty easy to do that based on newsroom

oliverkinch commented 2 weeks ago

hey, was notified of this now. didnt give any language identifier, but its pretty easy to do that based on newsroom

Are you familiar with which newsrooms that correspond to which languages? I can then make a PR where we use this hardcoded mapping instead of being dependent on a language classifier.

Here are the 13 newsroom abbreviations:

{'sno-commercial', 'vektklubb', 'e24', 'e24partnerstudio', 'dinepenger', 'vgpartnerstudio', 'bt', 'tekno', 'vg', 'ap', 'randaberg24', 'ab', 'sa'}

simeneide commented 2 weeks ago

I should be as i work there :D

{
'sno-commercial' : 'no', 
'vektklubb' : 'no',
'e24' : 'no', 
'e24partnerstudio' : 'no', 
'dinepenger' : 'no', 
'vgpartnerstudio' : 'no', 
'bt' : 'no', 
'tekno' : 'no', 
'vg' : 'no', 
'ap' : 'no', 
'randaberg24' : 'no', 
'ab' : 'se', 
'sa' : 'no'}