Closed aRookieMan closed 4 years ago
Hi @aRookieMan For inference (i.e., the encode method) already implements this.
For training the changes should be easy.
For example, in this example: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_stsbenchmark.py
We use the SentencesDataset:
train_dataset = SentencesDataset(sts_reader.get_examples('sts-train.csv'), model)
If you use the SentencesDataset in your train script: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/datasets/SentencesDataset.py
I recommend to derive a class from it.
In line 56, you have:
tokenized_texts = [model.tokenize(text) for text in example.texts]
Just change that line, for example to:
tokenized_texts = [text for text in example.texts]
Then you can pass pre-tokenized Inputs to the SentencesDataset:
# This step could be saved to disk
examples = sts_reader.get_examples('sts-train.csv')
for example in examples:
example.text = model.tokenize(text)
# Pass the tokenized examples to your new TokenizedSentencesDataset class
train_dataset = TokenizedSentencesDataset(tokenized_examples , model)
Best Nils Reimers
Thanks!!! I have never seen such a rapid reply!
Hi @aRookieMan I am happy to help :)
I just added locally a parameter to the SentencesDataset that allows to pass pre-tokenized texts. It is not yet pushed to git, but it will be part of the next release (0.3.3).
Best Nils
@nreimers You may watch out https://github.com/UKPLab/sentence-transformers/blob/067ca0e2fa8f14765520c83014b11ab45054c3c2/sentence_transformers/readers/InputExample.py#L22 if the input is BERT ids :)
The repo and the documentation is very useful! After our comparison, we decide to use it in our project. But only one thing is that we have a big amount of texts, and your code will tokenize the texts every time, which costs about 2 hours. Can we first convert it into BERT token ids, then train it without this online process? I try to modify the code, but there are too many code needs to change. Do you have some plans to make solution for this situation?