Closed larsbun closed 4 days ago
@larsbun Is it possible to separate the Norwegian/Swedish samples in the dataset, or would we have to use a language identification model?
@larsbun Is it possible to separate the Norwegian/Swedish samples in the dataset, or would we have to use a language identification model?
I didn't know the dataset before someone told me at dinner on Tuesday (i.e., haven't worked on it), but a crude regexp search of the fields gave me this:
article_id newsroom article_title article_text_all summary num_bulletpoints num_words_per_bulletpoint num_words_per_bulletpoint_bucket num_bulletpoints_bucket
Seemingly, there is no language ID there. But I guess a ID model should separate them to a 100%, since the alphabets are different. I just looked through the file and saw some Swedish text.
hey, was notified of this now. didnt give any language identifier, but its pretty easy to do that based on newsroom
hey, was notified of this now. didnt give any language identifier, but its pretty easy to do that based on newsroom
Are you familiar with which newsrooms that correspond to which languages? I can then make a PR where we use this hardcoded mapping instead of being dependent on a language classifier.
Here are the 13 newsroom abbreviations:
{'sno-commercial', 'vektklubb', 'e24', 'e24partnerstudio', 'dinepenger', 'vgpartnerstudio', 'bt', 'tekno', 'vg', 'ap', 'randaberg24', 'ab', 'sa'}
I should be as i work there :D
{
'sno-commercial' : 'no',
'vektklubb' : 'no',
'e24' : 'no',
'e24partnerstudio' : 'no',
'dinepenger' : 'no',
'vgpartnerstudio' : 'no',
'bt' : 'no',
'tekno' : 'no',
'vg' : 'no',
'ap' : 'no',
'randaberg24' : 'no',
'ab' : 'se',
'sa' : 'no'}
Dataset name
Schibsted Summaries
Dataset link
https://huggingface.co/datasets/Schibsted/schibsted-article-summaries
Dataset languages
Describe the dataset
This is a json-formatted dataset of articles and human-created (as I gather) summaries from the Schibsted corporation in Norwegian and Swedish.