embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.95k stars 271 forks source link

Suggestion for clustering dataset (legislative texts) #744

Closed rbroc closed 6 months ago

rbroc commented 6 months ago

Just stumbled upon this dataset: https://huggingface.co/datasets/dreamproit/bill_labels_us, which has lots US Congress bills labeled by policy area. I won't probably have the time to add this, but thought it could be a suggestion if folks are looking for inspiration (feel free to close if note relevant).

Not a new language, but looking at existing clustering datasets it seems like that'd be a quite new domain.

It could also be a classification task, but clustering seems more interesting (and there is no natural train/dev/test split).

x-tabdeveloping commented 6 months ago

Unfortunately the period for dataset submissions ended yesterday, I'm closing this for now.