Open gab709 opened 2 weeks ago
Hi @gab709
Thanks for your question.
You can map all tasks using the holmes leaderboard file.
Therein the column probing dataset
corresponds to the specific task and the column linguistic phenomena
gives the specific mapping, for example xpos
-> part-of-speech
.
It is true that there are some more tasks than in the leaderboard files. We extended the benchmark already with some additional tasks, but do not have sufficient results to update the explorer at the moments.
Further, we work on releasing also the source code for the explorer and leaderboard. We'll keep you updated on that! :)
Hello, I'm interested in evaluating custom models on your benchmark. It would be helpful to have a file that maps each task to the corresponding phenomena. I've tried to extract this information from the leaderboard file, but it appears not all datasets are contained in this file. Specifically, after evaluating Flash Holmes, I'm searching for each task in the leaderboard file to find the related phenomena. By doing this, I still miss 25 tasks.
List of missing tasks
xpos missing zorro-irregular-verb missing gum-rst-cut-edu-relation-group missing ewok-social-relations missing gum-rst-cut-edu-relation missing ewok-agent-properties missing ewok-quantitative-properties missing gum-rst-cut-edu-type missing ewok-material-dynamics missing ewok-material-properties missing upos missing gum-rst-cut-edu-depth missing ewok-physical-dynamics missing fuse-negation missing ewok-physical-interactions missing zorro-agreement_determiner_noun-across_1_adjective missing ewok-social-properties missing bioscope-negation missing gum-rst-cut-edu-distance missing gum-rst-cut-edu-successively missing speculation missing ewok-social-interactions missing ewok-physical-relations missing ewok-spatial-relations missing gum-rst-cut-edu-count missing
It would be helpful to be able to replicate the analysis you perform in the explorer (https://holmes-explorer.streamlit.app/) with a custom result file. Thank you in advance! 😊