Open jiqing-feng opened 4 days ago
Hello!
Thanks for this PR - it's quite extensive already. I ran some tests with IPEX on WSL (as I'm on Windows currently) and translated the performance gains relative to fp32 to my normal benchmark graph here:
In essence, the performance gain is seemingly not very substantial, staying behind ONNX and OpenVINO. I'm curious if this roughly matches the expectations. As it stands right now, IPEX likely doesn't seem like an improvement over ONNX/OpenVINO and it might result in the backends becoming more complex without any notable gain. I only tested CPU - according to https://huggingface.co/docs/optimum/en/intel/ipex/inference, that's all that's supported via Optimum right now.
I also had some issues with running the IPEXModel
initially because the model itself was loaded as CPU whereas the dummy_inputs
were moved to self._device
in optimum-intel
- which was automatically set to cuda
as my machine as a CUDA-enabled GPU and my torch
was installed with cuda
support. I have a feeling like I can't fix that in Sentence Transformers via some parameters.
Another bottleneck that I'm personally noticing is that installing all optimum
backends into one virtualenvironment becomes quite constrictive. For example, my CI won't test the latest transformers
because optimum-intel[ipex]
doesn't support it yet. So I may have to separate my tests per backend and make CI runners for each backend. That way I can still test the latest transformers
for example.
cc @echarlaix @IlyasMoutawwakil as I believe you've both briefly worked on IPEX in Optimum Intel.
Thanks for the detailed feedback @tomaarsen! Yes that makes sense, @jiqing-feng I think we can instead add new integrations to optimum-intel
directly cc @IlyasMoutawwakil
HI @tomaarsen , thanks for your benchmarking! There are 2 main issues in your comment: performance and tests constrictive, so I suppose you will consider merging this PR after the 2 issues are solved? (I am fixing device issue that can be easily fixed in optimum-intel)
Hi @echarlaix sentence-transformers is also in our ipex scope, we aim to upstream ipex in sentence-transformers. As you know optimum-intel ipex is under big refactoring, I just found IPEXModel and this PR is enough for sentence-transformers so we don't plan to integrate the sentence-transformers model specifically in optimum-intel. What we need is to fix the transformers' version compatibility and performance. Please let me know your concerns. Thanks!
Hi @tomaarsen , can you share your benchmark script? Thanks!
Hi @tomaarsen . I use the evaluation_inference_speed.py for benchmarking, and make some little changes:
COMMAND: python evaluation_inference_speed.py
import sys
import time
import torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from optimum.intel.utils.modeling_utils import bind_cores_for_best_perf
bind_cores_for_best_perf()
model_name = sys.argv[1] if len(sys.argv) > 1 else "bert-base-nli-mean-tokens"
# Load a sentence transformer model
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "cpu"}
model = SentenceTransformer(model_name, model_kwargs=model_kwargs, device="cpu", backend="ipex")
max_sentences = 100_000
all_nli_dataset = load_dataset("sentence-transformers/all-nli", "pair", split="train")
sentences = list(set(all_nli_dataset["anchor"]))[:max_sentences]
print("Model Name:", model_name)
print("Number of sentences:", len(sentences))
for i in range(3):
print("Run", i)
start_time = time.time()
emb = model.encode(sentences, batch_size=32)
end_time = time.time()
diff_time = end_time - start_time
print(f"Done after {diff_time:.2f} seconds")
print(f"Speed: {len(sentences) / diff_time:.2f} sentences / second")
print("=====")
The results show: Speed ratio: ipex / torch = 1.6 Data collected from Intel 4th Gen Xeon.
We will fix the device and transformers version issue ASAP. Before that, please help to verify the performance. I suppose HF has access to Intel 4th Gen Xeon, do you mind validating on the PVC node (the pvc's CPU is 4th gen xeon)?
The script that I used is quite messy, i.e. something like this:
I've ran this for a lot of different backend types, 4 different models, and 3 datasets. See more details in https://sbert.net/docs/sentence_transformer/usage/efficiency.html#benchmarks I'm using an i7-17300K CPU for the CPU tests, i.e. consumer-grade hardware
I'm running your script now, with the bind_cores_for_best_perf
(I didn't use that one previously). I see it also requires pip install py-libnuma
.
It seems that my hardware does not contain the required instructions for torch.bfloat16
:
AssertionError: BF16 weight prepack needs the cpu support avx_ne_convert or avx512bw, avx512vl and avx512dq, but the desired instruction sets are not available. Please set dtype to torch.float or set weights_prepack to False.
Or torch.float16
:
AssertionError: FP16 weight prepack needs the cpu support avx_ne_convert or avx512_core_fp16, but the desired instruction sets are not available. Please set dtype to torch.float or set weights_prepack to False.
Only with torch.float
does it work correctly - and here it has a small performance improvement around 3%. I was also under the impression that this was running on float16 due to some of the warnings I saw, but I suspect that it actually ran on fp32. I didn't realise that the recent hardware was required to get the performance gain, but it makes sense given that only recent hardware can run BF16.
I'll try and get access to a Intel 4th Gen Xeon CPU.
And yes, if it's possible to get bf16 performance preservation (e.g. 99.9%+) with ~1.6x speedup, then I'll definitely consider merging this. If we can make that work, then I'll try and fix the tests issue that I mentioned. Some questions:
For your question:
Hi @echarlaix sentence-transformers is also in our ipex scope, we aim to upstream ipex in sentence-transformers. As you know optimum-intel ipex is under big refactoring,
Yes and would make sense to wait for the refactorization from https://github.com/huggingface/optimum-intel/pull/1009 before doing a benchmark @jiqing-feng
This PR enables ipex backend, script: