Open ir2718 opened 4 months ago
We are surely interested in coordinating efforts!
From what I know, there is a Serbian semantic textual similarity dataset https://reldi.spur.uzh.ch/blog/serbian-semantic-textual-similarity-news-corpus/, so that could be a starting point.
Overall not sure whether we do not need more complex benchmarks nowadays with response simpler for humans (yes, no). We are just now holiding the DIALECT-COPA shared task: https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa.
The decision is binary, but the task is super-hard even for the latest and greatest LLMs (Cerkno dialect just above the random baseline even for the overall best-performing GPT-4).
I'm aware of the existence of a Serbian STS dataset, although all non-English STS benchmark datasets I've found almost exclusively use MT with or without human correction on STSB (eg. Korean, Turkish, Romanian, Swedish). Besides, I've chosen Croatian as it's my native language and I can do the corrections. It's a bit tricky for other Balkan languages, as it would require native speakers in other languages as well, but I'm not aware of any enthusiasts willing to work on this.
Anyways, I'll open a PR once I'm finished with corrections and benchmarks, so if you think it's a good addition feel free to merge it.
Hello!
I've noticed the lack of resources for STS in Balkan languages, so I would like to contribute. I'm planning on using machine translation to translate STSB to Croatian, and then manually go over all examples to correct possible mistakes since I'm a native speaker.
Would you be interested in adding the final dataset to the benchmark?