UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.86k stars 2.44k forks source link

the multi dataset training is not working. TypeError: unhashable type: 'list' #2797

Open imrankh46 opened 2 months ago

imrankh46 commented 2 months ago

@tomaarsen hello, i am using the official example of sentence transformer for multi dataset training and its show the following error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 2
      1 # start training, the model will be automatically saved to the hub and the output directory
----> 2 trainer.train()

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1883         hf_hub_utils.enable_progress_bars()
   1884 else:
-> 1885     return inner_training_loop(
   1886         args=args,
   1887         resume_from_checkpoint=resume_from_checkpoint,
   1888         trial=trial,
   1889         ignore_keys_for_eval=ignore_keys_for_eval,
   1890     )

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2178, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2175     rng_to_sync = True
   2177 step = -1
-> 2178 for step, inputs in enumerate(epoch_iterator):
   2179     total_batched_samples += 1
   2181     if self.args.include_num_input_tokens_seen:

File /usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py:464, in DataLoaderShard.__iter__(self)
    462 if self.device is not None:
    463     current_batch = send_to_device(current_batch, self.device, non_blocking=self._non_blocking)
--> 464 next_batch = next(dataloader_iter)
    465 if batch_index >= self.skip_batches:
    466     yield current_batch

File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
    628 if self._sampler_iter is None:
    629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630     self._reset()  # type: ignore[call-arg]
--> 631 data = self._next_data()
    632 self._num_yielded += 1
    633 if self._dataset_kind == _DatasetKind.Iterable and \
    634         self._IterableDataset_len_called is not None and \
    635         self._num_yielded > self._IterableDataset_len_called:

File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:674, in _SingleProcessDataLoaderIter._next_data(self)
    673 def _next_data(self):
--> 674     index = self._next_index()  # may raise StopIteration
    675     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    676     if self._pin_memory:

File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:621, in _BaseDataLoaderIter._next_index(self)
    620 def _next_index(self):
--> 621     return next(self._sampler_iter)

File /usr/local/lib/python3.10/dist-packages/sentence_transformers/sampler.py:211, in ProportionalBatchSampler.__iter__(self)
    209 for dataset_idx in dataset_idx_sampler:
    210     sample_offset = sample_offsets[dataset_idx]
--> 211     yield [idx + sample_offset for idx in next(batch_samplers[dataset_idx])]

File /usr/local/lib/python3.10/dist-packages/sentence_transformers/sampler.py:125, in NoDuplicatesBatchSampler.__iter__(self)
    123 batch_indices = []
    124 for index in remaining_indices:
--> 125     sample_values = set(self.dataset[index].values())
    126     if sample_values & batch_values:
    127         continue

TypeError: unhashable type: 'list'
imrankh46 commented 2 months ago

main example are here.

https://sbert.net/docs/sentence_transformer/training_overview.html

can you provide example of sentence transformer v3 version? also this is not working for classification, pair classification and clustering.

tomaarsen commented 2 months ago

Hello!

Interesting. I'm OOO right now so I can't verify, but it seems like one of the training dataset columns has a list of data instead of a string. A column with a list of data is not supported.

imrankh46 commented 2 months ago

Hello!

Interesting. I'm OOO right now so I can't verify, but it seems like one of the training dataset columns has a list of data instead of a string. A column with a list of data is not supported.

  • Tom Aarsen

Also it for classification task, it's showing the following error. Add embadding and length of the labels.

Thank you for the clarification.

tomaarsen commented 2 months ago

I don't think I'm following. Could you elaborate or add the full error? Or perhaps a reproducible example?

imrankh46 commented 2 months ago

@tomaarsen hy tom. i am using multi dataset training and the following loss

rerank_loss = MultipleNegativesSymmetricRankingLoss(model)
# MultipleNegativesSymmetricRankingLoss use for retrival [title,text ]
retrival_loss = MultipleNegativesSymmetricRankingLoss(model)
# CoSENTLoss use for STS
sts_loss = CoSENTLoss(model)

but when i use the batch size 64 so its crashed. i am using 6 x H100 SXM gpu. any tips for large batch size?