Holmes-Benchmark / holmes-evaluation

4 stars 1 forks source link

Missing labels for tasks #9

Open gab709 opened 2 weeks ago

gab709 commented 2 weeks ago

Hello, I'm interested in evaluating custom models on your benchmark. It would be helpful to have a file that maps each task to the corresponding phenomena. I've tried to extract this information from the leaderboard file, but it appears not all datasets are contained in this file. Specifically, after evaluating Flash Holmes, I'm searching for each task in the leaderboard file to find the related phenomena. By doing this, I still miss 25 tasks.

List of missing tasks
xpos missing zorro-irregular-verb missing gum-rst-cut-edu-relation-group missing ewok-social-relations missing gum-rst-cut-edu-relation missing ewok-agent-properties missing ewok-quantitative-properties missing gum-rst-cut-edu-type missing ewok-material-dynamics missing ewok-material-properties missing upos missing gum-rst-cut-edu-depth missing ewok-physical-dynamics missing fuse-negation missing ewok-physical-interactions missing zorro-agreement_determiner_noun-across_1_adjective missing ewok-social-properties missing bioscope-negation missing gum-rst-cut-edu-distance missing gum-rst-cut-edu-successively missing speculation missing ewok-social-interactions missing ewok-physical-relations missing ewok-spatial-relations missing gum-rst-cut-edu-count missing

It would be helpful to be able to replicate the analysis you perform in the explorer (https://holmes-explorer.streamlit.app/) with a custom result file. Thank you in advance! 😊

holmesbenchmark commented 2 weeks ago

Hi @gab709 Thanks for your question. You can map all tasks using the holmes leaderboard file. Therein the column probing dataset corresponds to the specific task and the column linguistic phenomena gives the specific mapping, for example xpos -> part-of-speech. It is true that there are some more tasks than in the leaderboard files. We extended the benchmark already with some additional tasks, but do not have sufficient results to update the explorer at the moments.

Further, we work on releasing also the source code for the explorer and leaderboard. We'll keep you updated on that! :)