feat: look into ONNX enhanched transformer embeddings

davidberenstein1957 commented 1 year ago

Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.

davidberenstein1957 commented 1 year ago

https://huggingface.co/docs/transformers/serialization xlm model achitectures are supported by huggingface POC

python -m transformers.onnx --model="microsoft/Multilingual-MiniLM-L12-H384" onnx/

from transformers import AutoTokenizer
from onnxruntime import InferenceSession

tokenizer = AutoTokenizer.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
session = InferenceSession("onnx/model.onnx")
# ONNX Runtime expects NumPy arrays as input
inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
print(outputs)

davidberenstein1957 commented 1 year ago

Align this effort with https://github.com/Pandora-Intelligence/fast-sentence-transformers/issues/5.

Note that, potentially, model quantization will lead to degraded model performance, because the embeddings are used for a downstream task.

davidberenstein1957 commented 1 year ago

Probably also need this feature. https://github.com/Pandora-Intelligence/fast-sentence-transformers/issues/7

davidberenstein1957 / crosslingual-coreference

feat: look into ONNX enhanched transformer embeddings #14