UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.45k stars 2.5k forks source link

Running SBERT model through ONNX for variable length strings #1362

Open amandaortega opened 2 years ago

amandaortega commented 2 years ago

Hi.

I am trying to run my SBERT model through ONNX to speed up the performance at inference. I have successfully converted the model to .onnx extension. For only one string or a list of strings of the same length, the sentence embeddings generated by ONNX match exactly the embeddings generated by my original SBERT model. However, for a list of variable length strings, not all the positions of the embeddings generated by ONNX match the ones generated by the original model.

I have run both the original and the ONNX models in batches. To run ONNX model with strings of different lengths, I used the padding option of the tokenizer. After the ONNX model returns the output, I calculated the correct sentence embeddings by considering the attention mask the tokenizer returned, just as recommended at https://www.sbert.net/examples/applications/computing-embeddings/README.html.

def mean_pooling(model_output, attention_mask):
    token_embeddings = torch.from_numpy(model_output[0]) #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return (sum_embeddings / sum_mask).numpy()

def run_onnx(sentences, batch_size):
    iterator = range(0, len(sentences), batch_size)
    results = []
    for start_index in iterator:
        sentences_batch = [str(item) for item in sentences[start_index:start_index+batch_size]]

        tokens = tokenizer(sentences_batch, padding=True, truncation=True, return_tensors='pt')
        tokens_np = {name: np.atleast_2d(value) for name, value in tokens.items()}
        # the sentence embeddings are calculated by the mean value for the token axis        
        out = model.run(None, tokens_np)
        sentence_embeddings = mean_pooling(out, tokens["attention_mask"])
        results.extend(sentence_embeddings)
    return np.array(results)

As I said, when I run this code with only one string or a list of strings of the same length as input, the embeddings match. However, when I run with a list of variable length strings, they don't match, which makes me think the problem is with the padding strategy.

Does anyone have any suggestions on what could cause the problem?

Thanks a lot in advance!

ArEnSc commented 2 years ago

@amandaortega I found that tokens = tokenizer(sentences_batch, padding=True, truncation=True, return_tensors='pt') doesn't padd at all. I am also expericing that the hidden layer outputs of the onnx model and the sentence transformer model do not match given the same sentence did you experience the same thing?