IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Apache License 2.0
575 stars 45 forks source link

added conversion script and example #1

Open robertgshaw2-neuralmagic opened 8 months ago

robertgshaw2-neuralmagic commented 8 months ago

Added simple example to load GPTQ model from HF hub into Marlin format.

@efrantar

rosario-purple commented 8 months ago

@rib-2 Thanks for this! Unfortunately it doesn't work on my machine (8xA100), presumably because it's designed for only one GPU?

alyssavance@7e72bd4e-02:/scratch/brr$ python3 marlin/conversion/convert.py --model-id "TheBloke/Llama-2-7B-Chat-GPTQ" --save-path "./marlin-chat" --do-generation
Loading gptq model...
generation_config.json: 100%|█████████████████████████████████████████████████████| 137/137 [00:00<00:00, 987kB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████| 727/727 [00:00<00:00, 7.70MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 41.1MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 64.4MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████| 411/411 [00:00<00:00, 4.56MB/s]
Validating compatibility...
Converting model...
--- Converting Module: model.layers.0.self_attn.k_proj
Traceback (most recent call last):
  File "/scratch/brr/marlin/conversion/convert.py", line 143, in <module>
    model = convert_model(model).to("cpu")
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/scratch/brr/marlin/conversion/convert.py", line 80, in convert_model
    new_module.pack(linear_module, scales=copy.deepcopy(module.scales.data.t()))
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/marlin/__init__.py", line 117, in pack
    w = torch.round(w / s).int()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:1!
/scratch/miniconda3/envs/brr/lib/python3.10/tempfile.py:860: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpxyeacbfe'>
  _warnings.warn(warn_message, ResourceWarning)
robertgshaw2-neuralmagic commented 8 months ago

@rosario-purple just set CUDA_VISIBLE_DEVICES=0, you don't need multiple gpus for this