Lacking several datasets from ScandEval

KennethEnevoldsen / scandinavian-embedding-benchmark

A Scandinavian Benchmark for sentence embeddings

https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/

MIT License

27 stars 3 forks source link

Lacking several datasets from ScandEval #173

Closed tollefj closed 7 months ago

tollefj commented 7 months ago

It's been a while since I last ran experiments with seb (but I much prefer this interface than scandeval itself, with more control over model configurations). Now, however, some datasets seem to be missing from scandeval, like the ScandEval/norquad-mini and scala-da. Perhaps these are removed due to licenses, I don't know.

To avoid future problems with datasets, perhaps it would be an idea to create them from the originals instead of hosting subsets?

KennethEnevoldsen commented 7 months ago

@tollefj thanks for raising this concern and thanks for the compliment! I will look into fixing this issue monday

tollefj commented 7 months ago

I could gladly help out if there's an agreement on how to handle the data :) I envision something along the lines of how ScandEval did it. I just believe it should be clearer how to go from the source data to the subsets (if subsets are even desired?)

I find the scandeval implementation's abstraction level to be a bit too high, having to track down the evals through what feels like hundreds of files.

KennethEnevoldsen commented 7 months ago

That would be great @tollefj. I have already re-uploaded the dataset for MTEB so it should simply be reuploading replacing the links - you are more than welcome to do a PR on it.

Re. complexity of ScandEval. ScandEval proposes a different trade-off than SEB (focusing on especially on robustness). For that it also pays a cost in complexity and how fast the benchmark is to run.

KennethEnevoldsen commented 7 months ago

@tollefj added the fixes in #174 assuming everything pass, they will be merged in automatically.