Closed Renzhenxuexidemaimai closed 4 months ago
我刚才遇到了相同的问题,最后检查出来需要在config文件中把
max_input_length
参数设置的大一些比如4096,造成这个问题的原因是训练集和测试集中的样本过长,导致测试和验证的时候被跳过。(我每条数据长度都在预设值512以上。)不知道你是不是一样的问题。
应该会截断呀,其实是没有截断是吗,就比如说是512 tokens的长度那应该只有保留前512的部分
应该会截断呀,其实是没有截断是吗,就比如说是512 tokens的长度那应该只有保留前512的部分
可以看一下process_batch_eval
函数
里面有一行:
if len(input_ids) >= max_input_length: break
哦,这个是因为就算截断,输入也是不完整的,所以就直接就跳过了,如果真的因为显存不够,可以做成截断
我刚才遇到了相同的问题,最后检查出来需要在config文件中把
max_input_length
参数设置的大一些比如4096,造成这个问题的原因是训练集和测试集中的样本过长,导致测试和验证的时候被跳过。(我每条数据长度都在预设值512以上。)不知道你是不是一样的问题。
我的max_input_length长度设置在1024,但实际数据集里确实有不少超过1024的内容,因为不然我显存不够,我是两个40G的A100,加起来才80G的显存,其实感觉不够微调的,我目前就只修改了官方例子lora.yaml的train_file、val_file和max_input_length这三个,调了一下,发现max_input_length再长一些就显存不够了,就设在了1024
那有没有办法通过修改lora.yaml的其他参数,来增加max_input_length而不增加需求的显存呢?或者可以用其他的微调指令
修改这个配置文件本质没有修改代码啊,增加max_input_length 不增加显存需求的直接办法是降低lora rank 但是效果也不显著。至于40G显存不够,你是不是用了SFT呢
修改这个配置文件本质没有修改代码啊,增加max_input_length 不增加显存需求的直接办法是降低lora rank 但是效果也不显著。至于40G显存不够,你是不是用了SFT呢
就是官方给的指令 : OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b configs/lora.yaml 应该不是sft
不对,你为什么是8张卡,你不是2个40G的卡吗,那node只有2 呀,同时,尝试加入ds试试,类似我们提供的sft对应的那个ds配置文件
不对,你为什么是8张卡,你不是2个40G的卡吗,那node只有2 呀,同时,尝试加入ds试试,类似我们提供的sft对应的那个ds配置文件
我是说我没用sft的意思,并不是说我写的也是8张卡,真实跑起来如图:
或许可以按照类似我们提供的sft对应的那个ds配置文件,试试能成功吗,在配置文件的最后(lora.yaml)也加上deepspeed: configs/ds_zero_3.json这内容,然后把configs/ds_zero_3.json换成自己的绝对路径,如果是lora,一般是不会超过40G的显存的。
有一种办法是你先看看单卡行不行,单卡用python执行微调脚本,大概多少会爆
或许可以按照类似我们提供的sft对应的那个ds配置文件,试试能成功吗,在配置文件的最后(lora.yaml)也加上deepspeed: configs/ds_zero_3.json这内容,然后把configs/ds_zero_3.json换成自己的绝对路径,如果是lora,一般是不会超过40G的显存的。
有一种办法是你先看看单卡行不行,单卡用python执行微调脚本,大概多少会爆
这是sft的配置max_input_length=1024进行训练(我的服务器时间有点错乱,那个不是大问题)
[2024-06-12 07:01:52,891] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.2, git-hash=d6490eb, git-branch=HEAD
[2024-06-12 07:01:52,896] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.02422308921813965 seconds
[2024-06-12 07:01:52,922] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2024-06-12 07:01:52,922] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-06-12 07:01:52,933] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-06-12 07:01:52,933] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-06-12 07:01:52,933] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-06-12 07:01:52,933] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 3 optimizer
[2024-06-12 07:01:53,047] [INFO] [utils.py:779:see_memory_usage] Stage 3 initialize beginning
[2024-06-12 07:01:53,048] [INFO] [utils.py:780:see_memory_usage] MA 17.51 GB Max_MA 22.13 GB CA 18.0 GB Max_CA 23 GB
[2024-06-12 07:01:53,048] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory: used = 14.95 GB, percent = 7.9%
[2024-06-12 07:01:53,049] [INFO] [stage3.py:130:__init__] Reduce bucket size 16777216
[2024-06-12 07:01:53,049] [INFO] [stage3.py:131:__init__] Prefetch bucket size 15099494
[2024-06-12 07:01:53,157] [INFO] [utils.py:779:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-06-12 07:01:53,157] [INFO] [utils.py:780:see_memory_usage] MA 17.51 GB Max_MA 17.51 GB CA 18.0 GB Max_CA 18 GB
[2024-06-12 07:01:53,158] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory: used = 14.95 GB, percent = 7.9%
Parameter Offload: Total persistent parameters: 516096 in 121 params
[2024-06-12 07:01:53,284] [INFO] [utils.py:779:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-06-12 07:01:53,284] [INFO] [utils.py:780:see_memory_usage] MA 17.51 GB Max_MA 17.51 GB CA 18.0 GB Max_CA 18 GB
[2024-06-12 07:01:53,284] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory: used = 14.95 GB, percent = 7.9%
[2024-06-12 07:01:53,395] [INFO] [utils.py:779:see_memory_usage] Before creating fp16 partitions
[2024-06-12 07:01:53,396] [INFO] [utils.py:780:see_memory_usage] MA 17.51 GB Max_MA 17.51 GB CA 18.0 GB Max_CA 18 GB
[2024-06-12 07:01:53,396] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory: used = 14.95 GB, percent = 7.9%
[2024-06-12 07:02:00,647] [INFO] [utils.py:779:see_memory_usage] After creating fp16 partitions: 5
[2024-06-12 07:02:00,648] [INFO] [utils.py:780:see_memory_usage] MA 17.51 GB Max_MA 17.51 GB CA 17.51 GB Max_CA 18 GB
[2024-06-12 07:02:00,648] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory: used = 14.95 GB, percent = 7.9%
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/codes/GLM-4/finetune_demo/finetune.py:410 in main │
│ │
│ 407 │ ) │
│ 408 │ │
│ 409 │ if auto_resume_from_checkpoint.upper() == "" or auto_resume_from_c │
│ ❱ 410 │ │ trainer.train() │
│ 411 │ else: │
│ 412 │ │ output_dir = ft_config.training_args.output_dir │
│ 413 │ │ dirlist = os.listdir(output_dir) │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/transformers/trainer. │
│ py:1539 in train │
│ │
│ 1536 │ │ │ finally: │
│ 1537 │ │ │ │ hf_hub_utils.enable_progress_bars() │
│ 1538 │ │ else: │
│ ❱ 1539 │ │ │ return inner_training_loop( │
│ 1540 │ │ │ │ args=args, │
│ 1541 │ │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1542 │ │ │ │ trial=trial, │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/transformers/trainer. │
│ py:1690 in _inner_training_loop │
│ │
│ 1687 │ │ │ │ │ model, self.optimizer = self.accelerator.prepare( │
│ 1688 │ │ │ else: │
│ 1689 │ │ │ │ # to handle cases wherein we pass "DummyScheduler" su │
│ ❱ 1690 │ │ │ │ model, self.optimizer, self.lr_scheduler = self.accel │
│ 1691 │ │ │ │ │ self.model, self.optimizer, self.lr_scheduler │
│ 1692 │ │ │ │ ) │
│ 1693 │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/accelerate/accelerato │
│ r.py:1284 in prepare │
│ │
│ 1281 │ │ │ elif self.device.type == "xpu" and is_xpu_available(): │
│ 1282 │ │ │ │ args = self._prepare_ipex(*args) │
│ 1283 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │
│ ❱ 1284 │ │ │ result = self._prepare_deepspeed(*args) │
│ 1285 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │
│ 1286 │ │ │ result = self._prepare_megatron_lm(*args) │
│ 1287 │ │ else: │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/accelerate/accelerato │
│ r.py:1751 in _prepare_deepspeed │
│ │
│ 1748 │ │ │ │ │ │ if type(scheduler).__name__ in deepspeed.runt │
│ 1749 │ │ │ │ │ │ │ kwargs["lr_scheduler"] = scheduler │
│ 1750 │ │ │ │
│ ❱ 1751 │ │ │ engine, optimizer, _, lr_scheduler = deepspeed.initialize │
│ 1752 │ │ │ if optimizer is not None: │
│ 1753 │ │ │ │ optimizer = DeepSpeedOptimizerWrapper(optimizer) │
│ 1754 │ │ │ if scheduler is not None: │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/__init__.py │
│ :181 in initialize │
│ │
│ 178 │ │ │ │ │ │ │ │ │ │ config=config, │
│ 179 │ │ │ │ │ │ │ │ │ │ config_class=config_class) │
│ 180 │ │ else: │
│ ❱ 181 │ │ │ engine = DeepSpeedEngine(args=args, │
│ 182 │ │ │ │ │ │ │ │ │ model=model, │
│ 183 │ │ │ │ │ │ │ │ │ optimizer=optimizer, │
│ 184 │ │ │ │ │ │ │ │ │ model_parameters=model_parameters │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/eng │
│ ine.py:307 in __init__ │
│ │
│ 304 │ │ │ model_parameters = list(model_parameters) │
│ 305 │ │ │
│ 306 │ │ if has_optimizer: │
│ ❱ 307 │ │ │ self._configure_optimizer(optimizer, model_parameters) │
│ 308 │ │ │ self._configure_lr_scheduler(lr_scheduler) │
│ 309 │ │ │ self._report_progress(0) │
│ 310 │ │ elif self.zero_optimization(): │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/eng │
│ ine.py:1258 in _configure_optimizer │
│ │
│ 1255 │ │ optimizer_wrapper = self._do_optimizer_sanity_check(basic_opt │
│ 1256 │ │ │
│ 1257 │ │ if optimizer_wrapper == ZERO_OPTIMIZATION: │
│ ❱ 1258 │ │ │ self.optimizer = self._configure_zero_optimizer(basic_opt │
│ 1259 │ │ elif optimizer_wrapper == AMP: │
│ 1260 │ │ │ amp_params = self.amp_params() │
│ 1261 │ │ │ log_dist(f"Initializing AMP with these params: {amp_param │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/eng │
│ ine.py:1582 in _configure_zero_optimizer │
│ │
│ 1579 │ │ │ │ │
│ 1580 │ │ │ │ log_dist(f'Creating {model_dtype} ZeRO stage {zero_st │
│ 1581 │ │ │ │ from deepspeed.runtime.zero.stage3 import DeepSpeedZe │
│ ❱ 1582 │ │ │ │ optimizer = DeepSpeedZeroOptimizer_Stage3( │
│ 1583 │ │ │ │ │ self.module, │
│ 1584 │ │ │ │ │ optimizer, │
│ 1585 │ │ │ │ │ timers=timers, │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/zer │
│ o/stage3.py:362 in __init__ │
│ │
│ 359 │ │ │
│ 360 │ │ print_rank_0(f'Largest partitioned param numel = {largest_par │
│ 361 │ │ │
│ ❱ 362 │ │ self._setup_for_real_optimizer() │
│ 363 │ │ self.grad_position = {} │
│ 364 │ │ self.set_grad_positions() │
│ 365 │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/zer │
│ o/stage3.py:472 in _setup_for_real_optimizer │
│ │
│ 469 │ │
│ 470 │ def _setup_for_real_optimizer(self): │
│ 471 │ │ see_memory_usage("Before creating fp32 partitions", force=Tru │
│ ❱ 472 │ │ self._create_fp32_partitions() │
│ 473 │ │ see_memory_usage("After creating fp32 partitions", force=True │
│ 474 │ │ dist.barrier() │
│ 475 │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/zer │
│ o/stage3.py:864 in _create_fp32_partitions │
│ │
│ 861 │ │ │ │ │ │ │ self.subgroup_to_device[i]).clone().float │
│ 862 │ │ │ │ │ else: │
│ 863 │ │ │ │ │ │ self.fp32_partitioned_groups_flat.append(self │
│ ❱ 864 │ │ │ │ │ │ │ self.device).clone().float().detach()) │
│ 865 │ │ │ │
│ 866 │ │ │ self.fp32_partitioned_groups_flat[i].requires_grad = True │
│ 867 │ │ │ ds_id_begin = str(self.fp16_partitioned_groups_flat_id[i] │
╰──────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.80 GiB. GPU 1 has a
total capacity of 39.39 GiB of which 1.37 GiB is free. Including non-PyTorch
memory, this process has 0 bytes memory in use. Of the allocated memory 28.92
GiB is allocated by PyTorch, and 6.53 MiB is reserved by PyTorch but
unallocated. If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See
documentation for Memory Management
(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-06-12 07:02:00,769] [INFO] [utils.py:779:see_memory_usage] Before creating fp32 partitions
[2024-06-12 07:02:00,769] [INFO] [utils.py:780:see_memory_usage] MA 17.51 GB Max_MA 17.51 GB CA 17.51 GB Max_CA 18 GB
[2024-06-12 07:02:00,769] [INFO] [utils.py:787:see_memory_usage] CPU Virtual Memory: used = 14.95 GB, percent = 7.9%
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/codes/GLM-4/finetune_demo/finetune.py:410 in main │
│ │
│ 407 │ ) │
│ 408 │ │
│ 409 │ if auto_resume_from_checkpoint.upper() == "" or auto_resume_from_c │
│ ❱ 410 │ │ trainer.train() │
│ 411 │ else: │
│ 412 │ │ output_dir = ft_config.training_args.output_dir │
│ 413 │ │ dirlist = os.listdir(output_dir) │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/transformers/trainer. │
│ py:1539 in train │
│ │
│ 1536 │ │ │ finally: │
│ 1537 │ │ │ │ hf_hub_utils.enable_progress_bars() │
│ 1538 │ │ else: │
│ ❱ 1539 │ │ │ return inner_training_loop( │
│ 1540 │ │ │ │ args=args, │
│ 1541 │ │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1542 │ │ │ │ trial=trial, │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/transformers/trainer. │
│ py:1690 in _inner_training_loop │
│ │
│ 1687 │ │ │ │ │ model, self.optimizer = self.accelerator.prepare( │
│ 1688 │ │ │ else: │
│ 1689 │ │ │ │ # to handle cases wherein we pass "DummyScheduler" su │
│ ❱ 1690 │ │ │ │ model, self.optimizer, self.lr_scheduler = self.accel │
│ 1691 │ │ │ │ │ self.model, self.optimizer, self.lr_scheduler │
│ 1692 │ │ │ │ ) │
│ 1693 │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/accelerate/accelerato │
│ r.py:1284 in prepare │
│ │
│ 1281 │ │ │ elif self.device.type == "xpu" and is_xpu_available(): │
│ 1282 │ │ │ │ args = self._prepare_ipex(*args) │
│ 1283 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │
│ ❱ 1284 │ │ │ result = self._prepare_deepspeed(*args) │
│ 1285 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │
│ 1286 │ │ │ result = self._prepare_megatron_lm(*args) │
│ 1287 │ │ else: │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/accelerate/accelerato │
│ r.py:1751 in _prepare_deepspeed │
│ │
│ 1748 │ │ │ │ │ │ if type(scheduler).__name__ in deepspeed.runt │
│ 1749 │ │ │ │ │ │ │ kwargs["lr_scheduler"] = scheduler │
│ 1750 │ │ │ │
│ ❱ 1751 │ │ │ engine, optimizer, _, lr_scheduler = deepspeed.initialize │
│ 1752 │ │ │ if optimizer is not None: │
│ 1753 │ │ │ │ optimizer = DeepSpeedOptimizerWrapper(optimizer) │
│ 1754 │ │ │ if scheduler is not None: │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/__init__.py │
│ :181 in initialize │
│ │
│ 178 │ │ │ │ │ │ │ │ │ │ config=config, │
│ 179 │ │ │ │ │ │ │ │ │ │ config_class=config_class) │
│ 180 │ │ else: │
│ ❱ 181 │ │ │ engine = DeepSpeedEngine(args=args, │
│ 182 │ │ │ │ │ │ │ │ │ model=model, │
│ 183 │ │ │ │ │ │ │ │ │ optimizer=optimizer, │
│ 184 │ │ │ │ │ │ │ │ │ model_parameters=model_parameters │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/eng │
│ ine.py:307 in __init__ │
│ │
│ 304 │ │ │ model_parameters = list(model_parameters) │
│ 305 │ │ │
│ 306 │ │ if has_optimizer: │
│ ❱ 307 │ │ │ self._configure_optimizer(optimizer, model_parameters) │
│ 308 │ │ │ self._configure_lr_scheduler(lr_scheduler) │
│ 309 │ │ │ self._report_progress(0) │
│ 310 │ │ elif self.zero_optimization(): │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/eng │
│ ine.py:1258 in _configure_optimizer │
│ │
│ 1255 │ │ optimizer_wrapper = self._do_optimizer_sanity_check(basic_opt │
│ 1256 │ │ │
│ 1257 │ │ if optimizer_wrapper == ZERO_OPTIMIZATION: │
│ ❱ 1258 │ │ │ self.optimizer = self._configure_zero_optimizer(basic_opt │
│ 1259 │ │ elif optimizer_wrapper == AMP: │
│ 1260 │ │ │ amp_params = self.amp_params() │
│ 1261 │ │ │ log_dist(f"Initializing AMP with these params: {amp_param │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/eng │
│ ine.py:1582 in _configure_zero_optimizer │
│ │
│ 1579 │ │ │ │ │
│ 1580 │ │ │ │ log_dist(f'Creating {model_dtype} ZeRO stage {zero_st │
│ 1581 │ │ │ │ from deepspeed.runtime.zero.stage3 import DeepSpeedZe │
│ ❱ 1582 │ │ │ │ optimizer = DeepSpeedZeroOptimizer_Stage3( │
│ 1583 │ │ │ │ │ self.module, │
│ 1584 │ │ │ │ │ optimizer, │
│ 1585 │ │ │ │ │ timers=timers, │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/zer │
│ o/stage3.py:362 in __init__ │
│ │
│ 359 │ │ │
│ 360 │ │ print_rank_0(f'Largest partitioned param numel = {largest_par │
│ 361 │ │ │
│ ❱ 362 │ │ self._setup_for_real_optimizer() │
│ 363 │ │ self.grad_position = {} │
│ 364 │ │ self.set_grad_positions() │
│ 365 │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/zer │
│ o/stage3.py:472 in _setup_for_real_optimizer │
│ │
│ 469 │ │
│ 470 │ def _setup_for_real_optimizer(self): │
│ 471 │ │ see_memory_usage("Before creating fp32 partitions", force=Tru │
│ ❱ 472 │ │ self._create_fp32_partitions() │
│ 473 │ │ see_memory_usage("After creating fp32 partitions", force=True │
│ 474 │ │ dist.barrier() │
│ 475 │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/deepspeed/runtime/zer │
│ o/stage3.py:864 in _create_fp32_partitions │
│ │
│ 861 │ │ │ │ │ │ │ self.subgroup_to_device[i]).clone().float │
│ 862 │ │ │ │ │ else: │
│ 863 │ │ │ │ │ │ self.fp32_partitioned_groups_flat.append(self │
│ ❱ 864 │ │ │ │ │ │ │ self.device).clone().float().detach()) │
│ 865 │ │ │ │
│ 866 │ │ │ self.fp32_partitioned_groups_flat[i].requires_grad = True │
│ 867 │ │ │ ds_id_begin = str(self.fp16_partitioned_groups_flat_id[i] │
╰──────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.80 GiB. GPU 0 has a
total capacity of 39.39 GiB of which 2.87 GiB is free. Including non-PyTorch
memory, this process has 0 bytes memory in use. Of the allocated memory 28.92
GiB is allocated by PyTorch, and 6.53 MiB is reserved by PyTorch but
unallocated. If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See
documentation for Memory Management
(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-06-12 07:02:06,027] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 19475) of binary: /root/anaconda3/envs/py11/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/py11/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/root/anaconda3/envs/py11/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/py11/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/py11/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/py11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/py11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/codes/GLM-4/finetune_demo/finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-06-12_07:02:06
host : 1b0d510e2085
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 19476)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-12_07:02:06
host : 1b0d510e2085
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 19475)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
然后是单卡的lora,长度也给1024
0%| | 0/3000 [00:00<?, ?it/s]/root/anaconda3/envs/py11/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/codes/GLM-4/finetune_demo/finetune.py:410 in main │
│ │
│ 407 │ ) │
│ 408 │ │
│ 409 │ if auto_resume_from_checkpoint.upper() == "" or auto_resume_from_c │
│ ❱ 410 │ │ trainer.train() │
│ 411 │ else: │
│ 412 │ │ output_dir = ft_config.training_args.output_dir │
│ 413 │ │ dirlist = os.listdir(output_dir) │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/transformers/trainer. │
│ py:1539 in train │
│ │
│ 1536 │ │ │ finally: │
│ 1537 │ │ │ │ hf_hub_utils.enable_progress_bars() │
│ 1538 │ │ else: │
│ ❱ 1539 │ │ │ return inner_training_loop( │
│ 1540 │ │ │ │ args=args, │
│ 1541 │ │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1542 │ │ │ │ trial=trial, │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/transformers/trainer. │
│ py:1869 in _inner_training_loop │
│ │
│ 1866 │ │ │ │ │ self.control = self.callback_handler.on_step_begi │
│ 1867 │ │ │ │ │
│ 1868 │ │ │ │ with self.accelerator.accumulate(model): │
│ ❱ 1869 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1870 │ │ │ │ │
│ 1871 │ │ │ │ if ( │
│ 1872 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/transformers/trainer. │
│ py:2781 in training_step │
│ │
│ 2778 │ │ │ with amp.scale_loss(loss, self.optimizer) as scaled_loss: │
│ 2779 │ │ │ │ scaled_loss.backward() │
│ 2780 │ │ else: │
│ ❱ 2781 │ │ │ self.accelerator.backward(loss) │
│ 2782 │ │ │
│ 2783 │ │ return loss.detach() / self.args.gradient_accumulation_steps │
│ 2784 │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/accelerate/accelerato │
│ r.py:2125 in backward │
│ │
│ 2122 │ │ elif learning_rate is not None and self.has_lomo_optimizer: │
│ 2123 │ │ │ self.lomo_backward(loss, learning_rate) │
│ 2124 │ │ else: │
│ ❱ 2125 │ │ │ loss.backward(**kwargs) │
│ 2126 │ │
│ 2127 │ def set_trigger(self): │
│ 2128 │ │ """ │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/torch/_tensor.py:522 │
│ in backward │
│ │
│ 519 │ │ │ │ create_graph=create_graph, │
│ 520 │ │ │ │ inputs=inputs, │
│ 521 │ │ │ ) │
│ ❱ 522 │ │ torch.autograd.backward( │
│ 523 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 524 │ │ ) │
│ 525 │
│ │
│ /root/anaconda3/envs/py11/lib/python3.11/site-packages/torch/autograd/__init │
│ __.py:266 in backward │
│ │
│ 263 │ # The reason we repeat the same comment below is that │
│ 264 │ # some Python versions print out the first line of a multi-line fu │
│ 265 │ # calls in the traceback and some print out the last line │
│ ❱ 266 │ Variable._execution_engine.run_backward( # Calls into the C++ eng │
│ 267 │ │ tensors, │
│ 268 │ │ grad_tensors_, │
│ 269 │ │ retain_graph, │
╰──────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU 0 has a
total capacity of 39.39 GiB of which 30.81 MiB is free. Including non-PyTorch
memory, this process has 0 bytes memory in use. Of the allocated memory 30.99
GiB is allocated by PyTorch, and 803.38 MiB is reserved by PyTorch but
unallocated. If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See
documentation for Memory Management
(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
看起来显存都不够
或许可以按照类似我们提供的sft对应的那个ds配置文件,试试能成功吗,在配置文件的最后(lora.yaml)也加上deepspeed: configs/ds_zero_3.json这内容,然后把configs/ds_zero_3.json换成自己的绝对路径,如果是lora,一般是不会超过40G的显存的。 有一种办法是你先看看单卡行不行,单卡用python执行微调脚本,大概多少会爆
微调glm4不支持.json格式的数据集文件吗,微调时出现这个问题:NotImplementedError: Cannot load dataset in the '.json' format
或许可以按照类似我们提供的sft对应的那个ds配置文件,试试能成功吗,在配置文件的最后(lora.yaml)也加上deepspeed: configs/ds_zero_3.json这内容,然后把configs/ds_zero_3.json换成自己的绝对路径,如果是lora,一般是不会超过40G的显存的。 有一种办法是你先看看单卡行不行,单卡用python执行微调脚本,大概多少会爆
微调glm4不支持.json格式的数据集文件吗,微调时出现这个问题:NotImplementedError: Cannot load dataset in the '.json' format
貌似是不行的,我都把格式重新改为了了jsonl,改一下吧
我刚才遇到了相同的问题,最后检查出来需要在config文件中把
max_input_length
参数设置的大一些比如4096,造成这个问题的原因是训练集和测试集中的样本过长,导致测试和验证的时候被跳过。(我每条数据长度都在预设值512以上。)不知道你是不是一样的问题。我的max_input_length长度设置在1024,但实际数据集里确实有不少超过1024的内容,因为不然我显存不够,我是两个40G的A100,加起来才80G的显存,其实感觉不够微调的,我目前就只修改了官方例子lora.yaml的train_file、val_file和max_input_length这三个,调了一下,发现max_input_length再长一些就显存不够了,就设在了1024
请问解决了吗?我也是一样的问题,2张40G A100,lora跑不起来
请查看readme制作数据集
我刚才遇到了相同的问题,最后检查出来需要在config文件中把
max_input_length
参数设置的大一些比如4096,造成这个问题的原因是训练集和测试集中的样本过长,导致测试和验证的时候被跳过。(我每条数据长度都在预设值512以上。)不知道你是不是一样的问题。我的max_input_length长度设置在1024,但实际数据集里确实有不少超过1024的内容,因为不然我显存不够,我是两个40G的A100,加起来才80G的显存,其实感觉不够微调的,我目前就只修改了官方例子lora.yaml的train_file、val_file和max_input_length这三个,调了一下,发现max_input_length再长一些就显存不够了,就设在了1024
请问解决了吗?我也是一样的问题,2张40G A100,lora跑不起来
SFT80G不够用的
System Info / 系統信息
显卡2个A100,对话运行没有问题,没有修改任何的代码,只是用自己的数据集,数据格式是按tool的来写,但微调时训练发生了错误
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
Expected behavior / 期待表现
能看一下是什么问题么?