NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
581 stars 44 forks source link

Weird Bug when QAT training with HfArgumentParser #51

Open ShadowTeamCN opened 3 months ago

ShadowTeamCN commented 3 months ago

take the following code as simple example:

parser = transformers.HfArgumentParser(
    (ModelArguments, DataArguments, TrainingArguments, LoraArguments)
)
(
    _model_args,
    _data_args,
    _training_args,
    _lora_args,
) = parser.parse_args_into_dataclasses()

path='/home/tione/notebook/PretrainModelStore/Qwen1.5-7B-Chat/'
model = AutoModelForCausalLM.from_pretrained(path,attn_implementation='flash_attention_2',torch_dtype=torch.float16,device_map=None)
pass
tokenizer = AutoTokenizer.from_pretrained(path, use_fast=False)
global PAD_TOKEN_ID
PAD_TOKEN_ID=tokenizer.pad_token_id    

calib_data=[]
with jsonlines.open('cmdr_.jsonl') as reader:
    for item in tqdm.tqdm(reader):
        if random.random()<0.01:
            calib_data.append(item)
calib_dataset=LazySupervisedDataset(calib_data,tokenizer,2048,False)
calib_dataloader=DataLoader(calib_dataset,collate_fn=collate_fn,batch_size=1)
def calibrate_loop():
    for data in calib_dataloader:
        model(**data)

model = mtq.quantize(model, mtq.INT4_AWQ_CFG, calibrate_loop)

when uncomment the HfArgumentParser, the quantize raise exception with embedding image when comment the code HfArgumentParser, the quantization process is success my transformers version is 4.42.4 ,and modelopt 0.15.0 since the quantization code is none relationship with HfArgumentParser , I cannot figure it out why

realAsma commented 3 months ago

This seems unrelated to quantization or modelopt. To test this hypothesis, can you try calling calibrate_loop() before mtq.quantize step? I think you will see the same error when trying to run the calibrate_loop without any quantization.

ShadowTeamCN commented 3 months ago

This seems unrelated to quantization or modelopt. To test this hypothesis, can you try calling calibrate_loop() before mtq.quantize step? I think you will see the same error when trying to run the calibrate_loop without any quantization.

yes, you are right, it truly no relationship with modelopt, I will close this issue. besides this it is also very weird

ShadowTeamCN commented 3 months ago

This seems unrelated to quantization or modelopt. To test this hypothesis, can you try calling calibrate_loop() before mtq.quantize step? I think you will see the same error when trying to run the calibrate_loop without any quantization.

yes, you are right, it truly no relationship with modelopt, I will close this issue. besides this it is also very weird

https://github.com/huggingface/transformers/issues/32021

I've found the above issue similar to me ,and I'm also using torchrun with DeepSpeed ZeRO-3 config to launch my script. I'm wondering if the official QAT (Quantization Aware Training) pipeline is compatible with DeepSpeed ZeRO-3?

realAsma commented 3 months ago

@ShadowTeamCN We have not yet tested the QAT example with DeepSpeed backend. I will test it once I get a chance - however this might be next week because of my some other commitments.

In the meantime, can you please try if QAT works for you with accelerate backend?

ShadowTeamCN commented 3 months ago

@ShadowTeamCN We have not yet tested the QAT example with DeepSpeed backend. I will test it once I get a chance - however this might be next week because of my some other commitments.

In the meantime, can you please try if QAT works for you with accelerate backend?

Certainly. I attempted to use Accelerate with DeepSpeed ZeRO-3, but encountered same issues. Subsequently, I switched to ZeRO-2 and achieved success. Once the training process is complete, I plan to test the remaining QAT pipeline.

ShadowTeamCN commented 3 months ago

@ShadowTeamCN We have not yet tested the QAT example with DeepSpeed backend. I will test it once I get a chance - however this might be next week because of my some other commitments.

In the meantime, can you please try if QAT works for you with accelerate backend?

And when I switch to zero-2 , I can not succeed to train 70B sized model due to restricted resources

RalphMao commented 3 months ago

Hi @ShadowTeamCN ,

This is because deepspeed initialize weights at a late stage right before training. The solution we currently have is

Option 1: Insert the calibration code in the trainer, after the deepspeed initialization and before checkpoint loading

Option 2: Implement a post-completion hook for accelerator._prepare_deepspeed, if you don't want to modify HF transformers code. I will share more details later on

ShadowTeamCN commented 3 months ago

Hi @ShadowTeamCN ,

This is because deepspeed initialize weights at a late stage right before training. The solution we currently have is

Option 1: Insert the calibration code in the trainer, after the deepspeed initialization and before checkpoint loading

Option 2: Implement a post-completion hook for accelerator._prepare_deepspeed, if you don't want to modify HF transformers code. I will share more details later on

Thank you for reply, both option is ok to me, and I find an anthor method by myself, that is calling trainer.evaluate before calib loop, because evalute would initialize deepspeed , but then I finally failed due to OOM, idk how many memory does QAT needs additionally, or if oom just raised because of my wrong use .

RalphMao commented 3 months ago

Using deepspeed together with modelopt causes memory leak under some circumstances, we are still investigating.

RalphMao commented 3 months ago

@ShadowTeamCN Adding gc.collect() after every iteration fixes the memory leak