chengchingwen / Transformers.jl

Julia Implementation of Transformer models
MIT License
526 stars 75 forks source link

Performance issue #165

Open AbrJA opened 10 months ago

AbrJA commented 10 months ago

Testing the performance vs python I found a potential bottleneck when evaluating the model to get the token embeddings.

Julia: Around 800 ms

using Transformers
using Transformers.TextEncoders
using Transformers.HuggingFace

textenc, model = hgf"sentence-transformers/all-MiniLM-L6-v2"
sentences = ["This is an example sentence", "Each sentence is converted"]

using BenchmarkTools

@benchmark model(sentences_encoded)

BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  763.074 ms … 847.289 ms  ┊ GC (min … max): 1.60% … 3.46%
 Time  (median):     798.575 ms               ┊ GC (median):    3.50%
 Time  (mean ± σ):   807.619 ms ±  34.539 ms  ┊ GC (mean ± σ):  3.19% ± 0.70%

  █    █                  ██                        █        ██  
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁██ ▁
  763 ms           Histogram: frequency by time          847 ms <

 Memory estimate: 195.70 MiB, allocs estimate: 4719116.

Python: Around 20 ms

from transformers import AutoTokenizer, AutoModel
import torch

sentences = ['This is an example sentence', 'Each sentence is converted']

textenc = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

encoded_input = textenc(sentences, padding=True, truncation=True, return_tensors='pt')

from timeit import timeit

def compute():
    #python is using a pointer but ...
    model(**encoded_input)

n = 100
result = timeit("compute()", setup='from __main__ import compute', number=n)

print("Total time : %.1f ms" % (1000 * (result/n)))

Total time : 21.5 ms

I would expect around the same time or maybe Julia being faster but it's almost 40x slower. Am I doing something wrong? Has this an explanation? Have someone detected this before?

I would appreciate any help, thank you!

chengchingwen commented 10 months ago

The benchmark code seems correct. My initial guess is that we don't fully utilize multithreading in our implementation. I would need to do more investigation.