Interest in a benchmark for STS

clarinsi / benchich

BENCHić - the benchmark for Bosnian, Croatian, Montenegrin, Serbian (and friends)

2 stars 0 forks source link

Interest in a benchmark for STS #2

Open ir2718 opened 4 months ago

ir2718 commented 4 months ago

Hello!

I've noticed the lack of resources for STS in Balkan languages, so I would like to contribute. I'm planning on using machine translation to translate STSB to Croatian, and then manually go over all examples to correct possible mistakes since I'm a native speaker.

Would you be interested in adding the final dataset to the benchmark?

nljubesi commented 4 months ago

We are surely interested in coordinating efforts!

From what I know, there is a Serbian semantic textual similarity dataset https://reldi.spur.uzh.ch/blog/serbian-semantic-textual-similarity-news-corpus/, so that could be a starting point.

Overall not sure whether we do not need more complex benchmarks nowadays with response simpler for humans (yes, no). We are just now holiding the DIALECT-COPA shared task: https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa.

The decision is binary, but the task is super-hard even for the latest and greatest LLMs (Cerkno dialect just above the random baseline even for the overall best-performing GPT-4).

ir2718 commented 4 months ago

I'm aware of the existence of a Serbian STS dataset, although all non-English STS benchmark datasets I've found almost exclusively use MT with or without human correction on STSB (eg. Korean, Turkish, Romanian, Swedish). Besides, I've chosen Croatian as it's my native language and I can do the corrections. It's a bit tricky for other Balkan languages, as it would require native speakers in other languages as well, but I'm not aware of any enthusiasts willing to work on this.

Anyways, I'll open a PR once I'm finished with corrections and benchmarks, so if you think it's a good addition feel free to merge it.