First version for Klokan+Umimeto

hynky1999 commented 5 months ago

Purpose

Adds two new tasks: Umimeto and KlokanQA
Both tasks are handled as Question Answering with predefined answers
The n_fewshots are updated, so that all samples can fit into 1024 context length of gpt2 tokenizer.

Why handle KlokanQA as multi-choice with all pos answers shown

Examples:

Tři klokani váží dohromady 97 kg. Každý z nich má jinou hmotnost, kterou lze vyjádřit přirozeným číslem. Určete největší možnou hmotnost nejlehčího klokana. 1 kg | 30 kg | 31 kg | 32 kg | 33 kg

As can be seen, without the proposed solutions there are multiple correct solutions.

Why handle UmimetoQA as multi-choice with all pos answers shown

Examples:

math;Jednotky hmotnosti: ze života;5;40 g;rohlík;varná konvice
biology;Paprskoploutvé ryby;9;Hmyzem a drobnými bezobratlými živočichy se živí:;pstruh obecný;sumec velký

For Umimeto non A/B MMLU style would also work, but I like this version works better, because this task has really bad assignments and the second possibility renders the context of the question better in my opinion.

Why we use logprobs instead of exact_match ?

I run several tests on Mixtral and for some questions it will not follow the expected format, thus rendering extraction unfeasible. This is even bigger problem if the LLMs are not Instruction/RLHF tuned and work only in completion mode. I had the same experience on my czeval benchmark with weak 7B models.y

Misc

The umimeto dataset is unreachable. The reasoning is simple it currently lives in my personal repository on hf in private mode. Since I don't have write perms to CZLC group I can't make repository there.

hynky1999 commented 4 months ago

Pro jistotu jsem nakonec klokana vybalanoval (Náhodně permutoval odpovědi a upravil správnou)

Maximální délky promptů (bez description), s použitím `gpt2-tokenizeru`

Umimeto-qa:

[('biology', 126),
 ('chemistry', 118),
 ('czech', 139),
 ('history', 135),
 ('informatics', 147),
 ('math', 114),
 ('physics', 125)]

Klokan-qa

[(0, 243), (1, 248), (2, 340), (3, 264), (4, 330), (5, 273)]

Distribuce jednotlivých tříd

Umimeto-qa:

Klokan-qa

MFajcik commented 4 months ago

Díky!!!

DCGM / lm-evaluation-harness