the multi dataset training is not working. TypeError: unhashable type: 'list'

imrankh46 commented 2 months ago

@tomaarsen hello, i am using the official example of sentence transformer for multi dataset training and its show the following error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 2
      1 # start training, the model will be automatically saved to the hub and the output directory
----> 2 trainer.train()

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1883         hf_hub_utils.enable_progress_bars()
   1884 else:
-> 1885     return inner_training_loop(
   1886         args=args,
   1887         resume_from_checkpoint=resume_from_checkpoint,
   1888         trial=trial,
   1889         ignore_keys_for_eval=ignore_keys_for_eval,
   1890     )

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2178, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2175     rng_to_sync = True
   2177 step = -1
-> 2178 for step, inputs in enumerate(epoch_iterator):
   2179     total_batched_samples += 1
   2181     if self.args.include_num_input_tokens_seen:

File /usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py:464, in DataLoaderShard.__iter__(self)
    462 if self.device is not None:
    463     current_batch = send_to_device(current_batch, self.device, non_blocking=self._non_blocking)
--> 464 next_batch = next(dataloader_iter)
    465 if batch_index >= self.skip_batches:
    466     yield current_batch

File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
    628 if self._sampler_iter is None:
    629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630     self._reset()  # type: ignore[call-arg]
--> 631 data = self._next_data()
    632 self._num_yielded += 1
    633 if self._dataset_kind == _DatasetKind.Iterable and \
    634         self._IterableDataset_len_called is not None and \
    635         self._num_yielded > self._IterableDataset_len_called:

File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:674, in _SingleProcessDataLoaderIter._next_data(self)
    673 def _next_data(self):
--> 674     index = self._next_index()  # may raise StopIteration
    675     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    676     if self._pin_memory:

File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:621, in _BaseDataLoaderIter._next_index(self)
    620 def _next_index(self):
--> 621     return next(self._sampler_iter)

File /usr/local/lib/python3.10/dist-packages/sentence_transformers/sampler.py:211, in ProportionalBatchSampler.__iter__(self)
    209 for dataset_idx in dataset_idx_sampler:
    210     sample_offset = sample_offsets[dataset_idx]
--> 211     yield [idx + sample_offset for idx in next(batch_samplers[dataset_idx])]

File /usr/local/lib/python3.10/dist-packages/sentence_transformers/sampler.py:125, in NoDuplicatesBatchSampler.__iter__(self)
    123 batch_indices = []
    124 for index in remaining_indices:
--> 125     sample_values = set(self.dataset[index].values())
    126     if sample_values & batch_values:
    127         continue

TypeError: unhashable type: 'list'

imrankh46 commented 2 months ago

main example are here.

https://sbert.net/docs/sentence_transformer/training_overview.html

can you provide example of sentence transformer v3 version? also this is not working for classification, pair classification and clustering.

tomaarsen commented 2 months ago

Hello!

Interesting. I'm OOO right now so I can't verify, but it seems like one of the training dataset columns has a list of data instead of a string. A column with a list of data is not supported.

Tom Aarsen

imrankh46 commented 2 months ago

Hello!

Interesting. I'm OOO right now so I can't verify, but it seems like one of the training dataset columns has a list of data instead of a string. A column with a list of data is not supported.

Tom Aarsen

Also it for classification task, it's showing the following error. Add embadding and length of the labels.

Thank you for the clarification.

tomaarsen commented 2 months ago

I don't think I'm following. Could you elaborate or add the full error? Or perhaps a reproducible example?

Tom Aarsen

imrankh46 commented 2 months ago

@tomaarsen hy tom. i am using multi dataset training and the following loss

rerank_loss = MultipleNegativesSymmetricRankingLoss(model)
# MultipleNegativesSymmetricRankingLoss use for retrival [title,text ]
retrival_loss = MultipleNegativesSymmetricRankingLoss(model)
# CoSENTLoss use for STS
sts_loss = CoSENTLoss(model)

but when i use the batch size 64 so its crashed. i am using 6 x H100 SXM gpu. any tips for large batch size?

UKPLab / sentence-transformers

the multi dataset training is not working. TypeError: unhashable type: 'list' #2797