-
Here is my understanding of the existing state of things and what I think we should be doing to make our lower-bit kernels more performant at both small and larger batch sizes. I'm making this an RFC …
-
### Checklist
- [X] I've checked that there is no other issue about this feature request.
- [X] This issue contains only one feature request.
- [X] The title of this issue accurately describes the fe…
-
Since ba01ad37, LoRas loaded in 8bit to the Q8_0 GGUF generate to a poor quality. Loading the LoRa in 16bit appears to fix this issue, but there are subtle differences in the generations from rounding…
-
During testing with the --load-in-low-bit features with the vLLM for CPU example. I noticed the model is not optimized based on this option.
I found that it needs to pass in the load_in_low_bit ar…
-
Every time when I run the test, it will load the original model and covert to lower bit.
If we load a 34B model on 4 ARC card, it will take a long time to covert the model and also need huge number o…
-
I tried many unet setting like dev-fp16 with automatic Diffusion in Low Bits/ dev-fp16 with fp8e4m3fn in low bits/ dev-fp8_e4m3fn with automatic Diffusion in Low Bits. but for every single unet settin…
-
Dear Ipex Team,
I was wondering if there was a way of saving a model that has been optimised and quantised in its new state for future loading for HF/Pytorch models.
I noticed there was a method i…
-
```
What steps will reproduce the problem?
1.just running a benchmark
2.
3.
What is the expected output? What do you see instead?
Computed 2274.76 PMKs/s total.
#1: 'CUDA-Device #1 'GeForce 8400 GS''…
-
```
What steps will reproduce the problem?
1.just running a benchmark
2.
3.
What is the expected output? What do you see instead?
Computed 2274.76 PMKs/s total.
#1: 'CUDA-Device #1 'GeForce 8400 GS''…
-
```
What steps will reproduce the problem?
1.just running a benchmark
2.
3.
What is the expected output? What do you see instead?
Computed 2274.76 PMKs/s total.
#1: 'CUDA-Device #1 'GeForce 8400 GS''…