Training a T5 model - Githubissues

pritamdeka commented 2 years ago

Hi @NThakur20 I was wondering if we can train a T5 model as when I was loading a T5 model from HF there seems to be an error.

thakur-nandan commented 2 years ago

Hi @pritamdeka,

Could you give a bit more context mentioned below:

Which model exactly did you use? LInk to the model, if available?
Which code did you use to train the model?
What error did you encounter during training?
If possible, a sample code snippet to reproduce your error.

Thank you!

Kind Regards, Nandan Thakur

pritamdeka commented 2 years ago

Hi @NThakur20 Thanks for the reply.

I used the train_sbert_BM25_hardnegs.py file for training using the SCIFACT dataset. Made a few changes for the T5 model such as changed the following line

word_embedding_model = models.Transformer(model_name, max_seq_length=300) to word_embedding_model = models.T5.T5(model_name, max_seq_length=300)

Also the model I used is castorini/monot5-base-msmarco

The error is something like this:

Epoch:   0% 0/1 [00:00<?, ?it/s]
Iteration:   0% 0/1121 [00:00<?, ?it/s]
Epoch:   0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/beir/examples/retrieval/training/train_sbert_BM25_hardnegs.py", line 132, in <module>
    use_amp=True)
  File "/usr/local/lib/python3.7/dist-packages/beir/retrieval/train.py", line 148, in fit
    callback=callback, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/SentenceTransformer.py", line 682, in fit
    data = next(data_iterator)
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/datasets/NoDuplicatesDataLoader.py", line 41, in __iter__
    yield self.collate_fn(batch) if self.collate_fn is not None else batch
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/SentenceTransformer.py", line 534, in smart_batching_collate
    tokenized = self.tokenize(texts[idx])
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/SentenceTransformer.py", line 311, in tokenize
    return self._first_module().tokenize(texts)
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/models/T5.py", line 55, in tokenize
    return self.tokenizer.encode(self.task_identifier+text)
TypeError: can only concatenate str (not "list") to str

rahmanidashti commented 1 year ago

Have you found a solution for this? Is there any T5-based models currently implemented in BeIR?

beir-cellar / beir

Training a T5 model #51