Open robertgshaw2-neuralmagic opened 8 months ago
@rib-2 Thanks for this! Unfortunately it doesn't work on my machine (8xA100), presumably because it's designed for only one GPU?
alyssavance@7e72bd4e-02:/scratch/brr$ python3 marlin/conversion/convert.py --model-id "TheBloke/Llama-2-7B-Chat-GPTQ" --save-path "./marlin-chat" --do-generation
Loading gptq model...
generation_config.json: 100%|█████████████████████████████████████████████████████| 137/137 [00:00<00:00, 987kB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████| 727/727 [00:00<00:00, 7.70MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 41.1MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 64.4MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████| 411/411 [00:00<00:00, 4.56MB/s]
Validating compatibility...
Converting model...
--- Converting Module: model.layers.0.self_attn.k_proj
Traceback (most recent call last):
File "/scratch/brr/marlin/conversion/convert.py", line 143, in <module>
model = convert_model(model).to("cpu")
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/scratch/brr/marlin/conversion/convert.py", line 80, in convert_model
new_module.pack(linear_module, scales=copy.deepcopy(module.scales.data.t()))
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/marlin/__init__.py", line 117, in pack
w = torch.round(w / s).int()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:1!
/scratch/miniconda3/envs/brr/lib/python3.10/tempfile.py:860: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpxyeacbfe'>
_warnings.warn(warn_message, ResourceWarning)
@rosario-purple just set CUDA_VISIBLE_DEVICES=0
, you don't need multiple gpus for this
Added simple example to load GPTQ model from HF hub into Marlin format.
@efrantar