Latent Feature-based Data Splits

This project aims to go beyond the random train-test split by developing a more challenging data-splitting process to better evaluate generalisation performance. We rely on a models internal representations to create a data split, creating the split by clustering the internal representations and assigning clusters to either the train or the test set. Hate Speech is used as a testing ground for developing the splitting method.

Authors

Maike Züfle m.s.zufle@sms.ed.ac.uk
Verna Dankers v.dankers@sms.ed.ac.uk
Ivan Titov ititov@inf.ed.ac.uk

Checklist:

[x] I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
[x] Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
[x] I have read the description of what should be in the doc.md of my task, and have added the required arguments.
[x] I have submitted or will submit an accompanying paper to the GenBench workshop.

GenBench / genbench_cbt_2023

[Task Submission] Hate Speech Detection (`latent_feature_splits`) #37

Latent Feature-based Data Splits

Authors

Checklist: