HanGuo97 / flute

Fast Matrix Multiplications for Lookup Table-Quantized LLMs
https://arxiv.org/abs/2407.10960
Apache License 2.0
188 stars 6 forks source link

Only CUDA devices are supported, but got: {device} ({device.type}) #2

Closed LiMa-cas closed 4 months ago

LiMa-cas commented 4 months ago

Hi, after I used “pip install flute-kernel” ”CUDA_VISIBLE_DEVICES=0 python -m flute.integrations.base --pretrained_model_name_or_path /extra_data/llama/Meta-Llama-3-8B-Instruct --save_directory /extra_data/llama/Meta-Llama-3-8B-Instruct-Flute --num_bits 4 --group_size 128”

there are warning as follows and stopped, please tell me how to solve this: I use A6000-48G

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.71it/s] /extra_data/miniconda3/envs/Ominiquant/lib/python3.10/site-packages/flute/integrations/base.py:46: UserWarning: Quantization always happen on 1st GPU warnings.warn(f"Quantization always happen on 1st GPU") /extra_data/miniconda3/envs/Ominiquant/lib/python3.10/site-packages/flute/utils.py:51: UserWarning: Only CUDA devices are supported, but got: cpu (cpu) warnings.warn(f"Only CUDA devices are supported, but got: {device} ({device.type})")

HanGuo97 commented 4 months ago

Hi, thanks for trying it!

This is a benign warning, and feel free to ignore it. It was initially a warning for a different reason. It stalls probably just because it needs some time to process it.

LiMa-cas commented 4 months ago

but it quit the processing......

HanGuo97 commented 4 months ago

Do you have any error message? Without knowing anything, my a priori guess is that you got "CPU" OOM. (But you are just quantizing the 8B model, so that's a bit odd.)

LiMa-cas commented 4 months ago

in this case,How could I put it on GPU? I thought it was runining on GPU. No messages printed, just quit after warinings

HanGuo97 commented 4 months ago

Then maybe it actually finished successfully? (Try listing the directory you specified.)

Let me explain a bit what's going on behind the scene.

In order to prepare the model for FLUTE, we need to quantize and apply some FLUTE-specific packing. To avoid GPU OOM, we put the model on CPU first. Then, layer-by-layer, we put the layer to GPU and quantize it. You could put the model to GPU, but we made that choice so it works well with 70B model.

LiMa-cas commented 4 months ago

yes,it finished,but there is no tokenizer in it,so I could not use the ppl script to get the ppl result

HanGuo97 commented 4 months ago

Ah yeah, good catch! We will fix that once we have the Learned Normal Float Quantization code pushed into the codebase.

In the meantime, a simple workaround is to pass --tokenizer /extra_data/llama/Meta-Llama-3-8B-Instruct.

LiMa-cas commented 4 months ago

thanks, but If I copy the tokenizer.jason /tokenizer_config.jason from Meta-Llama-3-8B-Instruct, there are still errors: Traceback (most recent call last): File "/extra_data/datasets/evaluate/ppl_eval.py", line 72, in model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="cuda")#float16 File "/extra_data/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/extra_data/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3838, in from_pretrained ) = cls._load_pretrained_model( File "/extra_data/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/extra_data/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "/extra_data/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 362, in set_module_tensor_to_device raise ValueError( ValueError: Trying to set a tensor of shape torch.Size([1024, 14336]) in "weight" (which has shape torch.Size([4096, 14336])), this look incorrect.

HanGuo97 commented 4 months ago

Unfortunately, we don’t support “loading” quantized model in HF. (Mostly because we target inference in another platform like vLLM.) That being said, there’s a simple workaround we used internally for prototyping —- you can quantize model using the Python API and use it in the same Python session. The tricky part is to make sure the model loads in GPU not CPU by default.

@radi-cho should have the code snippet. I’m away from laptop right now (it’s midnight in my timezone.) But I can send you the code in the morning.

LiMa-cas commented 4 months ago

So how can I get the ppl for wikitext2?thanks

HanGuo97 commented 4 months ago

If you use the Python API directly, I believe it should work. For example,

    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path,
        device_map="cpu",  # <-- replace this with cuda/auto
        torch_dtype=torch_dtype)

    if isinstance(model, (LlamaForCausalLM, Gemma2ForCausalLM)):
        prepare_model_flute(
            module=model.model.layers,
            num_bits=num_bits,
            group_size=group_size,
            fake=fake)
    else:
        raise NotImplementedError
HanGuo97 commented 4 months ago

Closing the issue as I assume this is fixed. Feel free to reopen if you still need help!

radi-cho commented 4 months ago

@LiMa-cas For perplexity calculation, you can follow this huggingface example. It should work the same way for a quantized or dense model.