Open Mikeriess opened 1 month ago
The following datasets from NorBench are already in ScandEval:
NoCoLa could be a substitute for a ScaLA-nb replacement, but tests would need to be conducted.
I included PoS tagging in the past, but since almost all the medium-to-good models have identical near-perfect performance, I decided to remove it again.
Coreference resolution could be cool to include (the NARC dataset doesn't seem finished yet, however), and potentially dependency parsing. I included dependency parsing in the past, and the main issue was that I didn't find a canonical way to fit models on these without having to resort to complicated graph-based approaches as in SpaCy. Also, evaluating decoder models on dependency parsing might be tricky.
Several of the text classification tasks could also be interesting. It would just need an analysis in terms of which datasets to include, to avoid including too many redundant ones.
Thanks for the fast reply, Dan.
OK - so the following classification datasets seem relevant to include:
Linguistic acceptance and Co-reference:
Do you need some help with this?
NoReC = Norwegian Review Corpus, and that one is already in ScandEval. I know there's a fine-grained version, but at least one of the versions is already in there 🙂
The NorDial and Talk of Norway might make sense, but like NoCoLA, would depend on tests. If all models are basically scoring perfectly on it, for instance, then it makes less sense to include.
But what we can definitely do straight away is adding NorDial, Talk of Norway and NoCoLA as unofficial datasets (this is simply setting unofficial=True
in the dataset configuration). This means that they won't automatically be included when running scandeval --language no
, but you can still benchmark models on them using, e.g., scandeval --dataset nordial
. This will allow us to perform the evaluation tests.
Allright - I can do the initial tests on our end. Any suggestions for how we should go about this practically? Pick 5-10 of the most popular open models and run the evaluation on them?
Great! And at least 10 models yeah - ideally a bit more, and both "good", "medium" and "bad" models 🙂
Dataset name
NorBench
Dataset link
https://github.com/ltgoslo/norbench
Dataset languages
Describe the dataset
This suggestion is about adding some of the datasets from NorBench to the Norwegian NLG evaluation: https://github.com/ltgoslo/norbench