Vahe1994 / AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
Apache License 2.0
1.15k stars 174 forks source link

Unsuccessful quantitative modeling using the MAIN method #129

Open Jun-Howie opened 3 weeks ago

Jun-Howie commented 3 weeks ago

log (computing)

(aqlm) root@f9f90a551b02:~/xinglin-data/AQLM# bash train.sh wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: Tracking run with wandb version 0.18.2 wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.

============ Load model... ============ Loading checkpoint shards: 100%|██████████████████████████████| 17/17 [00:01<00:00, 11.35it/s] Loading pretrained model ... Model loaded sucсessfully ...

============ Quantizing model... ============ Loading data ... /root/xinglin-data/AQLM/src/datautils.py:219: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. data = torch.load(name)[:nsamples] Loaded data from /root/xinglin-data/AQLM/train.pt; len(data)=1024 sequences

Starting AQ quantization ... catching layer inputs from data train.sh: line 23: 28722 Killed python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 --val_size=0 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=32 --relative_mse_tolerance=0.01 --finetune_batch_size=32 --finetune_max_epochs=10 --finetune_early_stop=3 --finetune_keep_best --local_batch_size=1 --offload_activations --wandb --resume --save $SAVE_PATH

configure

export CUDA_VISIBLE_DEVICES=0 # or e.g. 0,1,2,3 export MODEL_PATH=/root/xinglin-data/model/Qwen/Qwen2.5-32B-Instruct export DATASET_PATH=/root/xinglin-data/AQLM/train.pt export SAVE_PATH=/root/xinglin-data/Qwen2 export WANDB_PROJECT=MY_AQ_EXPS export WANDB_NAME=COOL_EXP_NAME

python main.py $MODEL_PATH $DATASET_PATH \ --nsamples=1024 \ --val_size=0 \ --num_codebooks=1 \ --nbits_per_codebook=16 \ --in_group_size=32 \ --relative_mse_tolerance=0.01 \ --finetune_batch_size=32 \ --finetune_max_epochs=10 \ --finetune_early_stop=3 \ --finetune_keep_best \ --local_batch_size=1 \ --offload_activations \ --wandb \ --resume \ --save $SAVE_PATH

ArtemBiliksin commented 3 weeks ago

Hello, @Jun-Howie!

Most likely you did not have enough RAM.

You are using nsamples=1024, the Qwen2.5-32B-Instruct model, and the --offload_activations key. Using the --offload_activations key means that the inps (of size [1024, 4096, 5120]) and outs (of size [1024, 4096, 5120]) tensors will be stored in RAM. Here 1024 is the value of nsamples, 4096 is the default model_seqlen value, 5120 is the hidden_size value of the Qwen2.5-32B-Instruct model.

Let's calculate how much RAM you will need for inps and outs. In your case, the data type of the inps and outs tensors will be bfloat16, i.e. 2 bytes per tensor parameter. Hence,

In total we get 40960 + 40960 = 81920 Mb. You use the --offload_activations key, so this memory will be used in RAM.

You can get around this problem by taking a smaller value of nsamples (for example, nsamples=512) or by using multiple GPU devices without the --offload_activations key.

Jun-Howie commented 3 weeks ago

Thank you , I use nsamples=512 which works for me!

Jun-Howie commented 2 weeks ago

A new problem has been encountered in quantizing Qwen2.5-7B with NVIDIA 4090.

(aqlm) root@Harvey-4090:/home/harvey/PM/AQLM# bash train.sh wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: Tracking run with wandb version 0.18.3 wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.

============ Load model... ============ Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 19.97it/s] Loading pretrained model ... Model loaded sucсessfully ...

============ Quantizing model... ============ Loading data ... /home/harvey/PM/AQLM/src/datautils.py:219: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. data = torch.load(name)[:nsamples] Loaded data from /home/harvey/PM/data/4096.pth; len(data)=256 sequences

Starting AQ quantization ... catching layer inputs from data Traceback (most recent call last): File "/home/harvey/PM/AQLM/main.py", line 918, in main() File "/home/harvey/PM/AQLM/main.py", line 892, in main quantize_model(model, args) File "/home/harvey/PM/AQLM/main.py", line 59, in quantize_model results = quantize_aq(model, train_data, val_data, args) File "/root/miniconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) File "/home/harvey/PM/AQLM/main.py", line 168, in quantize_aq inps, forward_args = get_inps(model, data, args.model_seqlen, args.devices, args.offload_activations) File "/root/miniconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, **kwargs) File "/home/harvey/PM/AQLM/main.py", line 105, in get_inps inps = [ File "/home/harvey/PM/AQLM/main.py", line 106, in torch.zeros( RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/harvey/PM/AQLM/wandb/offline-run-20241009_213235-pykhgwzr wandb: Find logs at: wandb/offline-run-20241009_213235-pykhgwzr/logs (aqlm) root@Harvey-4090:/home/harvey/PM/AQLM#

train.sh

export CUDA_VISIBLE_DEVICES=0 # or e.g. 0,1,2,3 export MODEL_PATH=/home/harvey/PM/modelscope/hub/qwen/Qwen2.5-7B-Instruct export DATASET_PATH=/home/harvey/PM/data/4096.pth export SAVE_PATH=/home/harvey/PM/ export WANDB_PROJECT=MY_AQ_EXPS export WANDB_NAME=COOL_EXP_NAME

python main.py $MODEL_PATH $DATASET_PATH \ --nsamples=256 \ --val_size=32 \ --num_codebooks=1 \ --nbits_per_codebook=16 \ --in_group_size=8 \ --relative_mse_tolerance=0.01 \ --finetune_batch_size=32 \ --finetune_max_epochs=10 \ --finetune_early_stop=3 \ --finetune_keep_best \ --local_batch_size=1 \ --offload_activations \ --wandb \ --resume \ --save $SAVE_PATH

ArtemBiliksin commented 2 weeks ago

Hi, @Jun-Howie !

This looks very suspicious. You need more information to understand what the problem is. Can you send a screenshot of nvidia-smi or nvtop when you run bash train.sh again?

Jun-Howie commented 2 weeks ago

image

nvtop command No GPU to monitor. I have a feeling that WSL2 is not recognizing the GPU correctly. nvitop or nvidia-smi can read the GPU status correctly. I may need to fix the system environment configuration first Thanks for your help!

ArtemBiliksin commented 2 weeks ago

Hi, @Jun-Howie ! I see what the problem is.

You're using the WSL. You may find these sources helpful:

Line 106 of main.py creates an inps tensor of size [256, 4096, 3584] (256 is the nsamples=256 value you specified, 4096 is the model_seqlen value, 3584 is the hidden_size value of the Qwen2.5-7B model.) with data type bfloat16, using pin_memory=True because you used the --offload_activations switch. In RAM, this tensor would take 256 4096 3584 * 2 / 1024 / 1024 = 7168 Mb. However, pin_memory=True is used, which will allocate some memory to the GPU.

You can do something simple to test in the same environment. For example

import torch
x = torch.zeros(256, 4096, 3584, pin_memory=True)