Just stumbled upon this dataset: https://huggingface.co/datasets/dreamproit/bill_labels_us, which has lots US Congress bills labeled by policy area.
I won't probably have the time to add this, but thought it could be a suggestion if folks are looking for inspiration (feel free to close if note relevant).
Not a new language, but looking at existing clustering datasets it seems like that'd be a quite new domain.
It could also be a classification task, but clustering seems more interesting (and there is no natural train/dev/test split).
Just stumbled upon this dataset: https://huggingface.co/datasets/dreamproit/bill_labels_us, which has lots US Congress bills labeled by policy area. I won't probably have the time to add this, but thought it could be a suggestion if folks are looking for inspiration (feel free to close if note relevant).
Not a new language, but looking at existing clustering datasets it seems like that'd be a quite new domain.
It could also be a classification task, but clustering seems more interesting (and there is no natural train/dev/test split).