Closed LiMa-cas closed 4 months ago
Hi, thanks for trying it!
This is a benign warning, and feel free to ignore it. It was initially a warning for a different reason. It stalls probably just because it needs some time to process it.
but it quit the processing......
Do you have any error message? Without knowing anything, my a priori guess is that you got "CPU" OOM. (But you are just quantizing the 8B model, so that's a bit odd.)
in this case,How could I put it on GPU? I thought it was runining on GPU. No messages printed, just quit after warinings
Then maybe it actually finished successfully? (Try listing the directory you specified.)
Let me explain a bit what's going on behind the scene.
In order to prepare the model for FLUTE, we need to quantize and apply some FLUTE-specific packing. To avoid GPU OOM, we put the model on CPU first. Then, layer-by-layer, we put the layer to GPU and quantize it. You could put the model to GPU, but we made that choice so it works well with 70B model.
yes,it finished,but there is no tokenizer in it,so I could not use the ppl script to get the ppl result
Ah yeah, good catch! We will fix that once we have the Learned Normal Float Quantization code pushed into the codebase.
In the meantime, a simple workaround is to pass --tokenizer /extra_data/llama/Meta-Llama-3-8B-Instruct
.
thanks, but If I copy the tokenizer.jason /tokenizer_config.jason from Meta-Llama-3-8B-Instruct, there are still errors:
Traceback (most recent call last):
File "/extra_data/datasets/evaluate/ppl_eval.py", line 72, in
Unfortunately, we don’t support “loading” quantized model in HF. (Mostly because we target inference in another platform like vLLM.) That being said, there’s a simple workaround we used internally for prototyping —- you can quantize model using the Python API and use it in the same Python session. The tricky part is to make sure the model loads in GPU not CPU by default.
@radi-cho should have the code snippet. I’m away from laptop right now (it’s midnight in my timezone.) But I can send you the code in the morning.
So how can I get the ppl for wikitext2?thanks
If you use the Python API directly, I believe it should work. For example,
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path,
device_map="cpu", # <-- replace this with cuda/auto
torch_dtype=torch_dtype)
if isinstance(model, (LlamaForCausalLM, Gemma2ForCausalLM)):
prepare_model_flute(
module=model.model.layers,
num_bits=num_bits,
group_size=group_size,
fake=fake)
else:
raise NotImplementedError
Closing the issue as I assume this is fixed. Feel free to reopen if you still need help!
Hi, after I used “pip install flute-kernel” ”CUDA_VISIBLE_DEVICES=0 python -m flute.integrations.base --pretrained_model_name_or_path /extra_data/llama/Meta-Llama-3-8B-Instruct --save_directory /extra_data/llama/Meta-Llama-3-8B-Instruct-Flute --num_bits 4 --group_size 128”
there are warning as follows and stopped, please tell me how to solve this: I use A6000-48G
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.71it/s] /extra_data/miniconda3/envs/Ominiquant/lib/python3.10/site-packages/flute/integrations/base.py:46: UserWarning: Quantization always happen on 1st GPU warnings.warn(f"Quantization always happen on 1st GPU") /extra_data/miniconda3/envs/Ominiquant/lib/python3.10/site-packages/flute/utils.py:51: UserWarning: Only CUDA devices are supported, but got: cpu (cpu) warnings.warn(f"Only CUDA devices are supported, but got: {device} ({device.type})")