ModelCloud / GPTQModel

An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).
Apache License 2.0
81 stars 15 forks source link

[BUG] The loading of sharded checkpoints with BitBLAS is currently not supported. #252

Closed ChenMnZ closed 1 month ago

ChenMnZ commented 1 month ago

Hello, thanks for your great work to make AutoGPTQ more useable.

I want to load GPTQ model and inference with BitBLAS backend.

The corresponding model loading code is:

model = GPTQModel.from_quantized(args.model, device_map='auto',torch_dtype=torch.float16, quantize_config=quant_config,backend=get_backend('BITBLAS'))

Such manner works on small models (4-bit 7B or 2-bit 13B). However, it failed with large model, such as 4-bit 13B. The raised error is: image

Qubitium commented 1 month ago

@ChenMnZ We will check again if BITBLAS has recently added sharding support.

Qubitium commented 1 month ago

Fixed by https://github.com/ModelCloud/GPTQModel/pull/270

Qubitium commented 1 month ago

Need to re-open this. The fix pr was reverted as the unit-test was incorrectly passing when it should have failed for sharding+bitblas.

LeiWang1999 commented 1 month ago

When initializing a BitBLAS quant linaer, fixed values of N and K are required. However, the shard runtime dispatches the shard shape to the runtime kernel. It might be beneficial to draw insights from VLLM, which allows initialization to be cognizant of shard information.

image
Qubitium commented 1 month ago

Fixed with PR #316