Open davidberenstein1957 opened 2 years ago
https://huggingface.co/docs/transformers/serialization xlm model achitectures are supported by huggingface POC
python -m transformers.onnx --model="microsoft/Multilingual-MiniLM-L12-H384" onnx/
from transformers import AutoTokenizer
from onnxruntime import InferenceSession
tokenizer = AutoTokenizer.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
session = InferenceSession("onnx/model.onnx")
# ONNX Runtime expects NumPy arrays as input
inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
print(outputs)
Align this effort with https://github.com/Pandora-Intelligence/fast-sentence-transformers/issues/5.
Note that, potentially, model quantization will lead to degraded model performance, because the embeddings are used for a downstream task.
Probably also need this feature. https://github.com/Pandora-Intelligence/fast-sentence-transformers/issues/7
Creating embeddings roughly takes 50% of the inference time.
allennlp/modules/token_embedders/pretrained_transformer_embedder.py
hold the logic for creating these embeddings. Make sure we can call them in a faster way.