Closed isaac-chung closed 6 months ago
For multilingual datasets you could do something like:
def dataset_transform(self):
for lang in self.langs: # it's self.hf_subsets now I think
self.dataset[lang] = self.stratified_subsampling(
self.dataset[lang],
self.seed,
self.metadata.eval_splits,
label="labels",
n_samples=2048,
)
Do you find doing this in a loop redundant?
No strong preferences if you'd prefer handling all languages at once, maybe you'd need to add a check on the languages you want to subsample as all languages may not have a number of samples > 2048.
Sometimes the simplest solutions are right under our noses! I think this works better. It would be better if we separate out languages in the subsampling and set a slightly lower number if needed.
Thanks!
While working on #742 , this error is shown when running
mteb -t WikiClusteringFastP2P -m sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Code:
Error:
Starting to notice that to handle multilingual datasets, the method will need access to
self.is_multilingual
andself.metadata_dict["eval_langs"]
. So part of me would like to convert this to a class method from a static method. Any thoughts on this?cc @imenelydiaker @KennethEnevoldsen @x-tabdeveloping or anyone else who's interested.