Testing the performance vs python I found a potential bottleneck when evaluating the model to get the token embeddings.
Julia: Around 800 ms
using Transformers
using Transformers.TextEncoders
using Transformers.HuggingFace
textenc, model = hgf"sentence-transformers/all-MiniLM-L6-v2"
sentences = ["This is an example sentence", "Each sentence is converted"]
using BenchmarkTools
@benchmark model(sentences_encoded)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
Range (min … max): 763.074 ms … 847.289 ms ┊ GC (min … max): 1.60% … 3.46%
Time (median): 798.575 ms ┊ GC (median): 3.50%
Time (mean ± σ): 807.619 ms ± 34.539 ms ┊ GC (mean ± σ): 3.19% ± 0.70%
█ █ ██ █ ██
█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁██ ▁
763 ms Histogram: frequency by time 847 ms <
Memory estimate: 195.70 MiB, allocs estimate: 4719116.
Python: Around 20 ms
from transformers import AutoTokenizer, AutoModel
import torch
sentences = ['This is an example sentence', 'Each sentence is converted']
textenc = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
encoded_input = textenc(sentences, padding=True, truncation=True, return_tensors='pt')
from timeit import timeit
def compute():
#python is using a pointer but ...
model(**encoded_input)
n = 100
result = timeit("compute()", setup='from __main__ import compute', number=n)
print("Total time : %.1f ms" % (1000 * (result/n)))
Total time : 21.5 ms
I would expect around the same time or maybe Julia being faster but it's almost 40x slower. Am I doing something wrong? Has this an explanation? Have someone detected this before?
The benchmark code seems correct. My initial guess is that we don't fully utilize multithreading in our implementation. I would need to do more investigation.
Testing the performance vs python I found a potential bottleneck when evaluating the model to get the token embeddings.
Julia: Around 800 ms
Python: Around 20 ms
I would expect around the same time or maybe Julia being faster but it's almost 40x slower. Am I doing something wrong? Has this an explanation? Have someone detected this before?
I would appreciate any help, thank you!