ScandEval / ScandEval

Evaluation of language models on mono- or multilingual tasks.
https://scandeval.com
MIT License
67 stars 14 forks source link

[BENCHMARK DATASET REQUEST] NorBench #435

Open Mikeriess opened 1 month ago

Mikeriess commented 1 month ago

Dataset name

NorBench

Dataset link

https://github.com/ltgoslo/norbench

Dataset languages

Describe the dataset

This suggestion is about adding some of the datasets from NorBench to the Norwegian NLG evaluation: https://github.com/ltgoslo/norbench

saattrupdan commented 1 month ago

The following datasets from NorBench are already in ScandEval:

  1. NorQuAD
  2. NorNE (both Bokmål and Nynorsk)
  3. NoReC

NoCoLa could be a substitute for a ScaLA-nb replacement, but tests would need to be conducted.

I included PoS tagging in the past, but since almost all the medium-to-good models have identical near-perfect performance, I decided to remove it again.

Coreference resolution could be cool to include (the NARC dataset doesn't seem finished yet, however), and potentially dependency parsing. I included dependency parsing in the past, and the main issue was that I didn't find a canonical way to fit models on these without having to resort to complicated graph-based approaches as in SpaCy. Also, evaluating decoder models on dependency parsing might be tricky.

Several of the text classification tasks could also be interesting. It would just need an analysis in terms of which datasets to include, to avoid including too many redundant ones.

Mikeriess commented 1 month ago

Thanks for the fast reply, Dan.

OK - so the following classification datasets seem relevant to include:

Linguistic acceptance and Co-reference:

Do you need some help with this?

saattrupdan commented 1 month ago

NoReC = Norwegian Review Corpus, and that one is already in ScandEval. I know there's a fine-grained version, but at least one of the versions is already in there 🙂

The NorDial and Talk of Norway might make sense, but like NoCoLA, would depend on tests. If all models are basically scoring perfectly on it, for instance, then it makes less sense to include.

But what we can definitely do straight away is adding NorDial, Talk of Norway and NoCoLA as unofficial datasets (this is simply setting unofficial=True in the dataset configuration). This means that they won't automatically be included when running scandeval --language no, but you can still benchmark models on them using, e.g., scandeval --dataset nordial. This will allow us to perform the evaluation tests.

Mikeriess commented 1 month ago

Allright - I can do the initial tests on our end. Any suggestions for how we should go about this practically? Pick 5-10 of the most popular open models and run the evaluation on them?

saattrupdan commented 1 month ago

Great! And at least 10 models yeah - ideally a bit more, and both "good", "medium" and "bad" models 🙂