PygmalionAI / aphrodite-engine

Large-scale LLM inference engine
https://aphrodite.pygmalion.chat
GNU Affero General Public License v3.0
1.09k stars 120 forks source link

[sparsetral and Qwen2idae]: support for mixtral of lora #330

Open sorasoras opened 7 months ago

sorasoras commented 7 months ago

The model to consider.

https://huggingface.co/serpdotai/sparsetral-16x7B-v2-SPIN_iter1 https://huggingface.co/LoneStriker/sparsetral-16x7B-v2-8.0bpw-h8-exl2/tree/main

https://huggingface.co/hywu/Qwen2idae-16x14B-v1.0

The closest model Aphrodite already supports.

mixtral moe but not quite the same

What's your difficulty of supporting the model you want?

https://arxiv.org/abs/2401.02731

This is a model with 16 of lora adapter that act as expert,

python -m aphrodite.endpoints.openai.api_server --model /mnt/c/model/sparsetral-16x7B-v2-SPIN_iter1-exl2-6.5/ -tp 2 --api-keys sk-example --trust-remote-code
You are using a model of type sparsetral to instantiate a model of type mistral. This is not supported for all configurations of models and can yield errors.
2024-03-16 16:21:04,398 INFO worker.py:1724 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 673, in from_engine_args
    placement_group = initialize_cluster(parallel_config,
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/ray_tools.py", line 111, in initialize_cluster
    raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.
sora@DESKTOP-CJNM4D3:~/aphrodite-engine/examples$ python -m aphrodite.endpoints.openai.api_server --model /mnt/c/model/sparsetral-16x7B-v2-SPIN_iter1-exl2-6.5/ -tp 2 --api-keys sk-example --trust-remote-code
You are using a model of type sparsetral to instantiate a model of type mistral. This is not supported for all configurations of models and can yield errors.
2024-03-16 16:42:50,177 INFO worker.py:1724 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 673, in from_engine_args
    placement_group = initialize_cluster(parallel_config,
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/ray_tools.py", line 111, in initialize_cluster
    raise ValueError(
ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.

I try to run it but it does not seems work

AlpinDale commented 7 months ago

I will take a closer look, but FYI, exl2 quants do not work with multi-gpu setups. It's the only quant with that limitation.

sorasoras commented 7 months ago

I will take a closer look, but FYI, exl2 quants do not work with multi-gpu setups. It's the only quant with that limitation.

It's single P40 setup inside WSL2, so I don't know why it value error like that.

AlpinDale commented 7 months ago

That would be the -tp 2 in your command. Please see here for a full list of the commands and what they do.

sorasoras commented 7 months ago

That would be the -tp 2 in your command. Please see here for a full list of the commands and what they do.

python -m aphrodite.endpoints.openai.api_server --model /mnt/c/model/sparsetral-16x7B-v2-SPIN_iter1-exl2-6.5/ -tp 1 --api-keys sk-example --trust-remote-code --dtype float32 --kv-cache-dtype fp8_e5m2
You are using a model of type sparsetral to instantiate a model of type mistral. This is not supported for all configurations of models and can yield errors.
INFO:     CUDA_HOME is not found in the environment. Using /usr/local/cuda as CUDA_HOME.
INFO:     Using fp8_e5m2 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. But it may cause slight accuracy drop. Currently we only support fp8 without scaling factors and make e5m2 as a default format.
INFO:     Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO:     Model = '/mnt/c/model/sparsetral-16x7B-v2-SPIN_iter1-exl2-6.5/'
INFO:     DataType = torch.float32
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = fp8_e5m2
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 115, in __init__
    self._init_workers()
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 157, in _init_workers
    self._run_workers("load_model")
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 1028, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/task_handler/worker.py", line 112, in load_model
    self.model_runner.load_model()
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/task_handler/model_runner.py", line 121, in load_model
    self.model = get_model(self.model_config, self.device_config,
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/modeling/loader.py", line 47, in get_model
    model_class = _get_model_architecture(model_config)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/modeling/loader.py", line 39, in _get_model_architecture
    raise ValueError(
ValueError: Model architectures ['modeling_sparsetral.MistralForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'OLMoForCausalLM', 'OPTForCausalLM', 'PhiForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM']

I think I got it working but 'modeling_sparsetral.MistralForCausalLM'] are not supported for now

AlpinDale commented 7 months ago

You can probably remove the modeling_sparsetral part from the model's config.json, it may work, but it'll skip all the MoE stuff. Same is happening with that exl2 quant I imagine, because exl2 doesn't support this arch.

sorasoras commented 7 months ago

You can probably remove the modeling_sparsetral part from the model's config.json, it may work, but it'll skip all the MoE stuff. Same is happening with that exl2 quant I imagine, because exl2 doesn't support this arch.

I have other question regarding qwen1.5/qwen1 in general. I try to load a qwen1.5 or 1 directly via load with python -m aphrodite.endpoints.openai.api_server --model sakura0.9_13B_Qwen1.5_Q5KS_1.2.gguf -tp 1 --api-keys sk-example I got

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 670, in from_engine_args
    engine_configs = engine_args.create_engine_configs()
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/args_tools.py", line 318, in create_engine_configs
    model_config = ModelConfig(
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/common/config.py", line 116, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/config.py", line 86, in get_config
    return extract_gguf_config(model)
  File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/config.py", line 28, in extract_gguf_config
    raise RuntimeError(f"Unsupported architecture {architecture}")
RuntimeError: Unsupported architecture qwen2

I guess i need to convert to PTH before using it?

AlpinDale commented 7 months ago

Works fine with the FP16 model. Can you link me to the gguf if it's public?

sorasoras commented 7 months ago

Works fine with the FP16 model. Can you link me to the gguf if it's public?

https://huggingface.co/shing3232/Sakura13B-LNovel-v0.9-qwen1.5-GGUF-IMX/blob/main/sakura0.9_13B_Qwen1.5_Q5KS_1.2.gguf

AlpinDale commented 7 months ago

Ah I see what the issue is.

We're using a custom GGUF model parser in aphrodite, so it means everything needs to be hand-written and implemented for every model arch. Llama, mistral, et al. fall under the llama category of models in llama.cpp, so their tensors and configs match with every llama model. Models like qwen2, command-r, etc, are supported by llama.cpp but use different names for tensors. To add support for these, we'd have to handle every model individually. I haven't gotten around to doing it yet, it'd need a fair bit of work. If you (or anyone else) would like to contribute for that, I'd start looking at these two places:

https://github.com/PygmalionAI/aphrodite-engine/blob/ed225f59cb2935e8fa102a27a13ec1e24c6bcdfc/aphrodite/transformers_utils/tokenizer.py#L16-L65

https://github.com/PygmalionAI/aphrodite-engine/blob/ed225f59cb2935e8fa102a27a13ec1e24c6bcdfc/aphrodite/modeling/hf_downloader.py#L208-L281

sorasoras commented 7 months ago

Ah I see what the issue is.

We're using a custom GGUF model parser in aphrodite, so it means everything needs to be hand-written and implemented for every model arch. Llama, mistral, et al. fall under the llama category of models in llama.cpp, so their tensors and configs match with every llama model. Models like qwen2, command-r, etc, are supported by llama.cpp but use different names for tensors. To add support for these, we'd have to handle every model individually. I haven't gotten around to doing it yet, it'd need a fair bit of work. If you (or anyone else) would like to contribute for that, I'd start looking at these two places:

https://github.com/PygmalionAI/aphrodite-engine/blob/ed225f59cb2935e8fa102a27a13ec1e24c6bcdfc/aphrodite/transformers_utils/tokenizer.py#L16-L65

https://github.com/PygmalionAI/aphrodite-engine/blob/ed225f59cb2935e8fa102a27a13ec1e24c6bcdfc/aphrodite/modeling/hf_downloader.py#L208-L281

I could not offer much help regarding coding, but I though if this could done in reverse to https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py this script convert hf to gguf but what if this can be done in revere. Anyway, thanks for the hard work.

bash99 commented 6 months ago

Ah I see what the issue is.

We're using a custom GGUF model parser in aphrodite, so it means everything needs to be hand-written and implemented for every model arch. Llama, mistral, et al. fall under the llama category of models in llama.cpp, so their tensors and configs match with every llama model. Models like qwen2, command-r, etc, are supported by llama.cpp but use different names for tensors. To add support for these, we'd have to handle every model individually. I haven't gotten around to doing it yet, it'd need a fair bit of work. If you (or anyone else) would like to contribute for that, I'd start looking at these two places:

https://github.com/PygmalionAI/aphrodite-engine/blob/ed225f59cb2935e8fa102a27a13ec1e24c6bcdfc/aphrodite/transformers_utils/tokenizer.py#L16-L65

https://github.com/PygmalionAI/aphrodite-engine/blob/ed225f59cb2935e8fa102a27a13ec1e24c6bcdfc/aphrodite/modeling/hf_downloader.py#L208-L281

vllm-gptq fork support qwen 1.5 in gguf format, is there anything we can got from them?

AlpinDale commented 6 months ago

@bash99 we have a PR at to fix this, and support arbitrary GGUF models.

bash99 commented 6 months ago

@bash99 we have a PR at to fix this, and support arbitrary GGUF models.

I'm trying to build the dev branch, but got this error even I use update-runtime.sh to ensure a fresh install

  Call Stack (most recent call first):
    /tmp/pip-build-env-4e68xd8m/overlay/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
    CMakeLists.txt:67 (find_package)

                                                                                                                                            -- Found Torch: /tmp/pip-build-env-4e68xd8m/overlay/lib/python3.11/site-packages/torch/lib/libtorch.so
  -- CUDA supported arches: 6.1;7.0;7.5;8.0;8.6;8.9;9.0
  -- CUDA target arches: 70-real
  -- Punica target arches:                                                                                                                  CMake Warning at CMakeLists.txt:275 (message):
    Unable to create _punica_C target because none of the requested
    architectures (70-real) are supported, i.e.  >= 8.0                                                                                   

  -- Hadamard target arches: 70-real
  -- Enabling C extension.
  -- Enabling moe extension.
  -- Configuring done (9.7s)
  -- Generating done (0.0s)
  -- Build files have been written to: /tmp/tmps7yh1wlm.build-temp
  [1/3] Building CXX object CMakeFiles/_moe_C.dir/kernels/moe/moe_ops.cpp.o
  [2/3] Building CUDA object CMakeFiles/_moe_C.dir/kernels/moe/softmax.cu.o
  [3/3] Linking CXX shared module /tmp/tmpzdwd33j5.build-lib/aphrodite/_moe_C.cpython-311-x86_64-linux-gnu.so
  ninja: error: unknown target '_punica_C'
sgsdxzy commented 6 months ago

@bash99 you can try export APHRODITE_INSTALL_PUNICA_KERNELS=0 before update-runtime.sh @AlpinDale probably we should make this flag auto disabled for unsupported arch.

AlpinDale commented 6 months ago

We unfortunately had the install condition for punica and hadamard kernels using the wrong facing comparison sign. Fixed in the latest commit to dev.

bash99 commented 6 months ago

We unfortunately had the install condition for punica and hadamard kernels using the wrong facing comparison sign. Fixed in the latest commit to dev.

I've try it, and it works, but I only got half success as the converted pth models don't support multi-cpus, but my target model QWen1.5-72B can not fit in one GPU (V100 32G).

I'm use below procedure, first test with QWen 1.5-4B after download the gguf and Qwen1.5-4B-Chat with

wget -c "https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GGUF/resolve/main/qwen1_5-4b-chat-q5_k_m.gguf?download=true" -O qwen1_5-4b-chat-q5_k_m.gguf
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen1.5-4B-Chat
python ../src/aphrodite-engine/examples/gguf_to_torch.py --input qwen1_5-4b-chat-q5_k_m.gguf --output qwen1_5-4b_gguf_ap_engine --config-path Qwen1.5-4B-Chat
cd Qwen1.5-4B-Chat
cp tokenizer_config.json tokenizer.json vocab.json ../qwen1_5-4b_gguf_ap_engine/
cd ..

then run ap_engine with it

CUDA_VISIBLE_DEVICES=0 python -m aphrodite.endpoints.openai.api_server --disable-log-requests --served-model-name Qwen1.5-4B-Chat --trust-remote-code --kv-cache-dtype fp8 -q gguf --model ./qwen1_5-4b_gguf_ap_engine/ --port 7868 --max-model-len 8192 --dtype half

it works great, test with open api client is ok.

But when do the same with QWen1.5 72B, I need run it with multi-gpu, so I run it with -tp 2 (my server has 4 V100 32G, qwen 1.5-72b q5K_M gguf is 48G size)

CUDA_VISIBLE_DEVICES=2,3 python -m aphrodite.endpoints.openai.api_server --disable-log-requests --served-model-name Qwen1.5-72B-Chat --trust-remote-code --kv-cache-dtype fp8 -q gguf --model ./qwen1_5-72b_gguf_ap_engine/ --port 7865 -gmu 0.98 --max-model-len 4096 --dtype half -tp 2

I got this error

INFO:     Cannot use FlashAttention backend for Volta and Turing GPUs.                                                                    INFO:     Using XFormers backend.                                                                                                         (RayWorkerAphrodite pid=303703) INFO:     Cannot use FlashAttention backend for Volta and Turing GPUs.                                    (RayWorkerAphrodite pid=303703) INFO:     Using XFormers backend.
INFO:     Aphrodite is using nccl==2.19.3
(RayWorkerAphrodite pid=303703) INFO:     Aphrodite is using nccl==2.19.3
INFO:     NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped
INFO:     reading GPU P2P access cache from /DaTa/.local/home/hai.li/.config/aphrodite/gpu_p2p_access_cache_for_2,3.json
(RayWorkerAphrodite pid=303703) INFO:     NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLi
nk equipped                                                                                                                               (RayWorkerAphrodite pid=303703) INFO:     reading GPU P2P access cache from /DaTa/.local/home/hai.li/.config/aphrodite/gpu_p2p_access_cach
e_for_2,3.json
Traceback (most recent call last):
...
  File "/DaTa/.local/home/hai.li/dl/src/aphrodite-engine/aphrodite/modeling/loader.py", line 104, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/DaTa/.local/home/hai.li/dl/src/aphrodite-engine/aphrodite/modeling/models/qwen2.py", line 417, in load_weights
    weight_loader(param, loaded_weight)
  File "/DaTa/.local/home/hai.li/dl/src/aphrodite-engine/aphrodite/modeling/layers/linear.py", line 277, in weight_loader                     assert param_data.shape == loaded_weight.shape
AssertionError
(RayWorkerAphrodite pid=303703) ERROR:    Error executing method load_model. This might cause deadlock in distributed execution.
[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

I've try the same on Qwen-4B model with -tp 2, and got the same error

  File "/DaTa/.local/home/hai.li/dl/src/aphrodite-engine/aphrodite/modeling/layers/linear.py", line 277, in weight_loader                     assert param_data.shape == loaded_weight.shape
AssertionError

Anyway, thanks for your great work.

AlpinDale commented 6 months ago

@sgsdxzy seems to me like an issue with parallelizing the lm_head. Does your PR fix this?

sgsdxzy commented 6 months ago

No it seems to be a bug somewhere else. In the mean time you can use GPTQ/AWQ/exl2 quants of the same model.

bash99 commented 6 months ago

No it seems to be a bug somewhere else. In the mean time you can use GPTQ/AWQ/exl2 quants of the same model.

It seems some bug related to Qwen and multi-gpu.

huggingface-cli download LoneStriker/Qwen1.5-72B-Chat-5.0bpw-h6-exl2 --local-dir Qwen1.5-72B-Chat-5.0bpw-h6-exl2
CUDA_VISIBLE_DEVICES=2,3 python -m aphrodite.endpoints.openai.api_server --disable-log-requests --served-model-name Qwen1.5-72B-Chat --trust-remote-code --kv-cache-dtype fp8 -q exl2 --model Qwen1.5-72B-Chat-5.0bpw-h6-exl2 --port 7865 -gmu 0.98 --max-model-len 8192 --dtype half -tp 2

got the same error

WARNING:  Casting torch.bfloat16 to torch.float16.
WARNING:  exl2 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
...
  File "/DaTa/.local/home/hai.li/dl/src/aphrodite-engine/aphrodite/modeling/layers/linear.py", line 277, in weight_loader
    assert param_data.shape == loaded_weight.shape
AssertionError

And same error for Qwen-4B, but works without "-tp 2".

sgsdxzy commented 6 months ago

I seemed to find what is wrong. Currently quants without merged weights (exl2, gguf) and models with linear bias (Qwen) is broken for tp>1. We are working on a fix. GPTQ and AWQ should work, I persoanlly uses the AWQ 4bit for Qwen-1.5-72B-chat.

bash99 commented 6 months ago

No it seems to be a bug somewhere else. In the mean time you can use GPTQ/AWQ/exl2 quants of the same model.

I've try GPTQ model made by LoneStriker Qwen1.5-72B-Chat-GPTQ (4bit-32g),Which don't crash but the output quality can not match Q5_k_m gguf from llama.cpp server.

I'll stick llama.cpp for a while, which lacks value quantize, so I only got 6144 context length from 64G Vram. And it is slowly than aphrodite-engine.

bash99 commented 6 months ago

I seemed to find what is wrong. Currently quants without merged weights (exl2, gguf) and models with linear bias (Qwen) is broken for tp>1. We are working on a fix. GPTQ and AWQ should work, I persoanlly uses the AWQ 4bit for Qwen-1.5-72B-chat.

As a GPU poor, I only have old V100 which don't supported by AWQ. Thanks for your efforts.

sgsdxzy commented 6 months ago

yeah the Q5_k_m gguf is 5.7bpw while the gptq 4bit g32 is 4.625bpw

bash99 commented 6 months ago

yeah the Q5_k_m gguf is 5.7bpw while the gptq 4bit g32 is 4.625bpw

but gptq 4bit has more impact on instruction follow, I'd one prompt works fine on 32b-Q_K_5_M, but in Qwen-72b-gptq-4bit-g32, the AI speaks for me after a few conversation.

bash99 commented 6 months ago

yeah the Q5_k_m gguf is 5.7bpw while the gptq 4bit g32 is 4.625bpw

but gptq 4bit has more impact on instruction follow, I'd one prompt works fine on 32b-Q_K_5_M, but in Qwen-72b-gptq-4bit-g32, the AI speaks for me after a few conversation.

update: it seems to be some problem with LoneStriker/Qwen-72b-gptq-4bit-g32 or aphrodite-engine. I've repeat my test with Qwen/Qwen1.5-72B-Chat-GPTQ-Int4, everything is OK.

sgsdxzy commented 6 months ago

GGUF/exl2 + Qwen + multigpu should have been fixed in latest dev, please test if it works for you.

bash99 commented 5 months ago

GGUF/exl2 + Qwen + multigpu should have been fixed in latest dev, please test if it works for you.

Yes, it works now.

Thank you for your great work.