-
Hello,
I have tried to run Llama2 model with 3 bit and 4 bit quantization. But is there a way that I can apply and run INT 8 quantized Llama2 model on AMD?
Regards,
Ashima
-
Hi,
I trained a keras model to extract gray-level segmentation maps.
I converted the model to TFLite and quantized the model.
The quantized model produces similar results on CPU and DSP HWs, if th…
-
### Describe the issue
1. Tried running https://github.com/intel/intel-extension-for-pytorch/blob/release/2.3/examples/cpu/inference/python/llm/run.py to generate the q_config_summary file
2. Then…
-
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("echarlaix/stable-diffusion-v1-5-inc-int8-dynamic").to("cpu")
# for reducing memory consumption get a…
-
### 🐛 Describe the bug
Hello,
I'm using the QuantTrainModule to train a MobileNetV2 model (using the MobileNetV2 class in this repo), and the quantized checkpoints have 32-bit floating-point weigh…
-
### What happened?
Hello,
I've been experimenting with some Olive passes on a custom model containing a transformer and some extra layers. Using the passes seem to slow down both the throughput and …
-
error log:
Generating train split: 3457 examples [00:00, 14292.20 examples/s]
Map (num_proc=32): 0%| | 0/3457 [00:00
-
Hi,
Huge fan of your work. I was wondering in your code are you using a 4bit or 8bit quantization lora?
-
Hi maintainers @yanboliang @Chillee ,
I saw Int8 Weight-Only Quantization is enabled in Mixtral 8x7B. And the next step should be supporting int4 and int4-gptq.
May I know the timeline of enabli…
-
Is it due to mel.n_len3000 that is the max of a single inference? If you feed some of the longers samples that whisper.cpp uses I presume its the mel.n_len3000 max as know they are much longer.
``…