benchmark typos - Githubissues

olisicky commented 1 week ago

Hi,

I evaluated my model with your benchmark and noticed that some benchmarks contain typos or strange parts in descriptions like propaganda_demonizace - >\n in description, propaganda_emoce - /n./n as description, csfever_nli - následujúcích in description, etc.

I did not go through lm-eval thoroughly but I can see them used in results. I wonder if these are intentional to test models or not. I checked the datasets in HF but there is no specific info for the prompts used here.

I'll be glad for clarification.

Thank you very much!

Ondřej

MFajcik commented 6 days ago

Hello @olisicky. In general, we followed this philosophy. Contents of the prompts are human written and thus susceptible to errors. With 54 tasks, the errors are inevitable. If the errors do not arguable impact the meaning, we consider them a part of human language. The prompts really depend on the creativity of the person at the time of writing, and were not following a standardized process (the only requirement was that the human should be able to understand the prompt).

Regarding the prompts you mentioned, I hope you find the answers below:

I was writing/editing all the prompts mentioned in your post, so I think I can respond here.

propaganda_demonizace is designed on purpose like it is, to test whether models are not susceptible to some kind of code writing style when prompted with >\n.

propaganda_emoce - I don't exactly remember what I aimed at here. I know I wanted to create some prompts for autoregressive language models (so not much of the description, just text + continuation). The description here seems to carry no more meaning than empty description. In reflection, maybe the empty description would be more suitable. However, I don't see why the current form should hurt or improve language model behavior, so it can stay like it is (and thus we can avoid reevaluating all models).

csfever_nli This is an error of my Slovak/Czech code mixing. I tried to be careful, but you found the case where I was not successful. Given the lexical similarity between Czech and Slovak version of these two words, I wouldn't change the prompt. Also CSfever contains 7 prompts (as taken from previous work of Stefanik et al., and edited), which is more then 5 we required.

Also, technically, this would require reevaluating csfever across all models, with very low likelihood of scores changing (two different characters in 1 word, in 1 prompt, given max out of 7).

Thanks for the heads up. Let me know if you find any more fishy things please.

Also be mindful that we pre-released v0.3 today with fixed subjectivity task.

olisicky commented 6 days ago

Hi @MFajcik . Thanks for quick response. I was just curious if you aimed to some specific errors which would be interesting. I understand that some mistakes are inevitable. I can confirm that the changes are minimal for most of the propaganda benchmarks at least for my model.

I think that it is good decision not to re-evaluate all the models, this should be just a detail.

I have another question for you (sorry for bothering you with everything :D). I pulled the repo after your last commits 4 days ago and now I have a problem that e.g. benczechmark_csfever_nli do not compute mcauroc and thus I cannot continue with evaluations. I noticed that you also recomputed leaderboard. Did you recomputed all the benchmarks or just some of them? I guess it will be problem on my side but would be nice to confirm with you that you did not get into similar problems.

Thank you!

Ondřej

MFajcik commented 6 days ago

Hello @olisicky, as this is unrelated to this issue, maybe we can take it offline (or please do the new issue :)) (you can write me on martin.fajcik@vut.cz, so we exchange contacts). I ran the fever just now with newly cloned repo, and it went fine (I can share run command with email).

DCGM / lm-evaluation-harness

benchmark typos #7