-
I have a model gemma 2 9B. I quantized this with AWQ-4bit. Size of model is 5.9GB. I set the kv_cache_free_gpu_mem_fraction to 0.01 and run triton on one A100. But triton takes 10748MiB of ram. I expe…
-
```py
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
import torch
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer
from transform…
-
WARNING: LoadImageBatch.IS_CHANGED() got an unexpected keyword argument 'node_id'
D:\ComfyUI_windows\ComfyUI\models\clip\siglip-so400m-patch14-384
D:\ComfyUI_windows\ComfyUI\models\LLM\Meta-Llama-3.1-…
-
An error occurs when loading a model using a for loop as shown below.
What could be the problem?
```py
for peft_model_id in peft_model_ids:
print(peft_model_id)
model, tokenizer =…
-
Did you run experiments with 4bit weight quantization? And/or did you try 4bit activation quantization? If so would be curious about the results, if not, why not?
-
Hi,
Thanks for releasing Grok! Is there any chance we could load the model in 4-bit given how large it is? Do you know if bitsandbytes support is planned (cc @timdettmers)?
Thanks!
-
@danielhanchen Hi Daniel, thanks for your work!
having an error just like in the issue #275, but this time while trying to save tuned version of unsloth/gemma-2-9b-it-bnb-4bit.
>> model.save_p…
-
hi all,
I had the following exception when trying to run the gradio example with: `python -m mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-72B-Instruct-4bit`
```
.../mlx_vlm/chat_ui.py", line …
-
When I load the model as following, throw the error: Cannot merge LORA layers when the model is loaded in 8-bit mode
How can I load model with 4bit when inferencing?
`
model_path = 'decapoda-resea…
-
# Problem statement
LLM workloads oriented on best latency are memory bound. Inference speed is limited by model weights access through DDR. That’s why major optimization technique is weights compres…