UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.17k stars 2.47k forks source link

Can we preprocess the tokenization process to speed up? #334

Closed aRookieMan closed 4 years ago

aRookieMan commented 4 years ago

The repo and the documentation is very useful! After our comparison, we decide to use it in our project. But only one thing is that we have a big amount of texts, and your code will tokenize the texts every time, which costs about 2 hours. Can we first convert it into BERT token ids, then train it without this online process? I try to modify the code, but there are too many code needs to change. Do you have some plans to make solution for this situation?

nreimers commented 4 years ago

Hi @aRookieMan For inference (i.e., the encode method) already implements this.

For training the changes should be easy.

For example, in this example: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_stsbenchmark.py

We use the SentencesDataset:

train_dataset = SentencesDataset(sts_reader.get_examples('sts-train.csv'), model)

If you use the SentencesDataset in your train script: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/datasets/SentencesDataset.py

I recommend to derive a class from it.

In line 56, you have:

tokenized_texts = [model.tokenize(text) for text in example.texts]

Just change that line, for example to:

tokenized_texts = [text for text in example.texts]

Then you can pass pre-tokenized Inputs to the SentencesDataset:

# This step could be saved to disk
examples = sts_reader.get_examples('sts-train.csv')
for example in examples:
    example.text = model.tokenize(text) 

# Pass the tokenized examples to your new TokenizedSentencesDataset class
train_dataset = TokenizedSentencesDataset(tokenized_examples , model)

Best Nils Reimers

aRookieMan commented 4 years ago

Thanks!!! I have never seen such a rapid reply!

nreimers commented 4 years ago

Hi @aRookieMan I am happy to help :)

I just added locally a parameter to the SentencesDataset that allows to pass pre-tokenized texts. It is not yet pushed to git, but it will be part of the next release (0.3.3).

Best Nils

aRookieMan commented 4 years ago

@nreimers You may watch out https://github.com/UKPLab/sentence-transformers/blob/067ca0e2fa8f14765520c83014b11ab45054c3c2/sentence_transformers/readers/InputExample.py#L22 if the input is BERT ids :)