PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.08k stars 5.55k forks source link

Paddle/PaddleNLP llama 7B pretrain 存在内存泄漏 #68336

Open shang-mt opened 5 hours ago

shang-mt commented 5 hours ago

bug描述 Describe the Bug

paddlepaddle-gpu 2.6.0.post117 paddlenlp : https://github.com/ZHUI/PaddleNLP, branch : sci/benckmark commit id 20fe363530c0e3868414f65ec394124ffac6b9b2 基于以上版本在A100上测试4卡 llama 7B pretrain,存在内存泄漏, image

llama/pretrain-llama_13b-pp4tp2sd2_stage1.json 配置文件信息如下, { "model_name_or_path": "facebook/llama-7b", "tokenizer_name_or_path": "facebook/llama-7b", "input_dir": "/workspace", "output_dir": "/root/llama-7b", "per_device_train_batch_size": 2, "gradient_accumulation_steps": 256, "per_device_eval_batch_size": 64, "tensor_parallel_degree": 2, "pipeline_parallel_degree": 2, "pipeline_parallel_config": "disable_partial_send_recv", "sharding_parallel_degree": -1, "virtual_pp_degree": 1, "sharding": "stage1", "sequence_parallel": 1, "adam_beta1": 0.9, "adam_beta2": 0.95, "use_flash_attention": true, "use_fused_rms_norm": true, "use_fused_rope": true, "max_seq_length": 2048, "learning_rate": 1e-04, "initializer_range": 0.002, "min_learning_rate": 1e-05, "warmup_steps": 2000, "logging_steps": 1, "max_steps": 200000, "save_steps": 2000, "eval_steps": 2000, "weight_decay": 0.1, "max_grad_norm": 1.0, "amp_master_grad": 1, "fp16": true, "fp16_opt_level": "O2", "dataloader_num_workers": 1, "continue_training": 0, "do_train": true, "do_eval": true, "do_predict": true, "disable_tqdm": true, "recompute": false, "distributed_dataloader": 0, "recompute_granularity": "full", "save_total_limit": 2, "eval_accumulation_steps": 16 }

其他补充信息 Additional Supplementary Information

No response

ZHUI commented 5 hours ago

你好,预训练数据采用了mmap读取,慢慢增长是正常现象,但是不会超过数据集大小的上限。

shang-mt commented 4 hours ago

你好,预训练数据采用了mmap读取,慢慢增长是正常现象,但是不会超过数据集大小的上限。

好的,我们再多测试一会,在我们自己卡上,1000G内存会耗光