InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.94k stars 308 forks source link

怎么指定2个cpu微调? #349

Closed zhanghui-china closed 9 months ago

zhanghui-china commented 9 months ago

现在无论怎么改py样例文件,微调的时候只能用到1个gpu 如: 把 internlm2_chat_7b_qlora_oasst1_e3_copy.py 的 batch_size 为1的时候,执行:xtuner train ./internlm2_chat_7b_qlora_oasst1_e3_copy.py --deepspeed deepspeed_zero2 image image GPU 17G

batch_size 为2 image image GPU 20G

batch_size 为3 image image GPU23G

虽然一个GPU接近占满,但是另一个GPU一直没用。

请问应该如何微调的时候让2个gpu显存都占满。

另外,batch_size改为2,3之后,貌似整体微调时间预估并没有明显变化。 请问如何可以大幅降低微调时间?

LZHgrla commented 9 months ago
  1. 2个GPU训练,单机通常使用分布式训练(DIST)即可,即 NPROC_PER_NODE=2 xtuner train XXXXX

    # On multiple GPUs
    (DIST) NPROC_PER_NODE=${GPU_NUM} xtuner train internlm2_chat_7b_qlora_oasst1_e3 --deepspeed deepspeed_zero2
    (SLURM) srun ${SRUN_ARGS} xtuner train internlm2_chat_7b_qlora_oasst1_e3 --launcher slurm --deepspeed deepspeed_zero2
  2. 增大 batchsize 后,速度并没有显著增加,可能是因为多条数据放到一个 mini-batch 后,存在很多 padding token 浪费计算资源。两种解决办法:

    1. 使用 pack_to_max_length=True,以避免 padding token。 https://github.com/InternLM/xtuner/blob/60fabeb2ba3fb552b8f2e2925353b328068434e5/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_oasst1_e3.py#L30
    2. 使用 xtuner.dataset.samplers.LengthGroupedSampler 替换默认的 DefaultSampler,以此减少 padding token。即使用以下代码替换

      -from mmengine.dataset import DefaultSampler
      +from xtuner.dataset.samplers import LengthGroupedSampler
      train_dataloader = dict(
        batch_size=batch_size,
        num_workers=dataloader_num_workers,
        dataset=train_dataset,
      -  sampler=dict(type=DefaultSampler, shuffle=True),
      +  sampler=dict(
      +       type=LengthGroupedSampler,
      +       length_property='length',
      +       per_device_batch_size=batch_size * accumulative_counts),
        collate_fn=dict(type=default_collate_fn))
      
      train_dataloader = dict(
        batch_size=batch_size,
        num_workers=dataloader_num_workers,
        dataset=llava_dataset,
      
        collate_fn=dict(type=default_collate_fn))

    注意,上述两种方法选择任意一种即可,两种重叠使用不会带来任何益处

zhanghui-china commented 9 months ago

好的,多谢,我试一下。

zhanghui-china commented 9 months ago
  1. 2个GPU训练,单机通常使用分布式训练(DIST)即可,即 NPROC_PER_NODE=2 xtuner train XXXXX
# On multiple GPUs
(DIST) NPROC_PER_NODE=${GPU_NUM} xtuner train internlm2_chat_7b_qlora_oasst1_e3 --deepspeed deepspeed_zero2
(SLURM) srun ${SRUN_ARGS} xtuner train internlm2_chat_7b_qlora_oasst1_e3 --launcher slurm --deepspeed deepspeed_zero2
  1. 增大 batchsize 后,速度并没有显著增加,可能是因为多条数据放到一个 mini-batch 后,存在很多 padding token 浪费计算资源。两种解决办法:

    1. 使用 pack_to_max_length=True,以避免 padding token。 https://github.com/InternLM/xtuner/blob/60fabeb2ba3fb552b8f2e2925353b328068434e5/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_oasst1_e3.py#L30
    2. 使用 xtuner.dataset.samplers.LengthGroupedSampler 替换默认的 DefaultSampler,以此减少 padding token。即使用以下代码替换
    -from mmengine.dataset import DefaultSampler
    +from xtuner.dataset.samplers import LengthGroupedSampler
    train_dataloader = dict(
       batch_size=batch_size,
       num_workers=dataloader_num_workers,
       dataset=train_dataset,
    -  sampler=dict(type=DefaultSampler, shuffle=True),
    +  sampler=dict(
    +       type=LengthGroupedSampler,
    +       length_property='length',
    +       per_device_batch_size=batch_size * accumulative_counts),
       collate_fn=dict(type=default_collate_fn))
    
    train_dataloader = dict(
       batch_size=batch_size,
       num_workers=dataloader_num_workers,
       dataset=llava_dataset,
    
       collate_fn=dict(type=default_collate_fn))

    注意,上述两种方法选择任意一种即可,两种重叠使用不会带来任何益处

您好,pack_to_max_length=True 去不掉,去掉会报错:

0fcaa14229404286105d3db5dc3066a
zhanghui-china commented 9 months ago
1c9d957ebd36b169d84fc65b59b0466
LZHgrla commented 9 months ago

@zhanghui-china 肯定不是删掉,不然会少变量,而是设为True 或 False

zhanghui-china commented 9 months ago

先实验修改 ack_to_max_length=False,并且 替换 xtuner.dataset.samplers.LengthGroupedSampler 之后 batch_size=4

bb7b202fa907e220731c0ce27908ddd

显存不足。

batch_size=3 显存23G

9f3fa8fe755b8a6b24175bd9504b4dc 2234e564ca81c7a7feef82f9ae72481 e23529bebb44ffe526c086afa228acd

时间在递减,有点算不清。(原来是4小时多一点) 不知道是快了还是慢了。

LZHgrla commented 9 months ago

两种方法选择任意一种即可,两种重叠使用不会带来任何益处

要想最快使用 pack_to_max_length=True 就够了

zhanghui-china commented 9 months ago

True的时候: batchsize=3

051a19ef7b64f1619136299a21fd322 1650a99a131164cc8abc748c0484b2b
zhanghui-china commented 9 months ago

为啥 前一种有33334,后一种才4040?差8倍之多?

LZHgrla commented 9 months ago

因为把多条数据样本做了pack

为啥 前一种有33334,后一种才4040?差8倍之多?

zhanghui-china commented 9 months ago

确实前面false那个,最后时间稳定在6小时多一点。比True的3种场景都要慢,True无论batch_size=1,2,3,时间都在4小时多一点。

zhanghui-china commented 9 months ago

01/23 20:09:33 - mmengine - INFO - Epoch(train) [1][ 1160/33334] lr: 1.9999e-04 eta: 6:18:30 time: 0.5878 data_time: 0.0014 memory: 10658 loss: 2.4294 01/23 20:09:39 - mmengine - INFO - Epoch(train) [1][ 1170/33334] lr: 1.9999e-04 eta: 6:17:51 time: 0.5887 data_time: 0.0014 memory: 10642 loss: 2.4394 01/23 20:09:44 - mmengine - INFO - Epoch(train) [1][ 1180/33334] lr: 1.9999e-04 eta: 6:17:11 time: 0.5838 data_time: 0.0015 memory: 10631 loss: 2.4017 01/23 20:09:50 - mmengine - INFO - Epoch(train) [1][ 1190/33334] lr: 1.9998e-04 eta: 6:16:31 time: 0.5819 data_time: 0.0013 memory: 10624 loss: 2.3772 01/23 20:09:56 - mmengine - INFO - Epoch(train) [1][ 1200/33334] lr: 1.9998e-04 eta: 6:15:51 time: 0.5827 data_time: 0.0013 memory: 10605 loss: 2.4403 01/23 20:10:02 - mmengine - INFO - Epoch(train) [1][ 1210/33334] lr: 1.9998e-04 eta: 6:15:10 time: 0.5721 data_time: 0.0012 memory: 10589 loss: 2.4141 01/23 20:10:07 - mmengine - INFO - Epoch(train) [1][ 1220/33334] lr: 1.9998e-04 eta: 6:14:25 time: 0.5544 data_time: 0.0018 memory: 10577 loss: 2.3803 01/23 20:10:13 - mmengine - INFO - Epoch(train) [1][ 1230/33334] lr: 1.9998e-04 eta: 6:13:37 time: 0.5457 data_time: 0.0013 memory: 10569 loss: 2.5105 01/23 20:10:18 - mmengine - INFO - Epoch(train) [1][ 1240/33334] lr: 1.9997e-04 eta: 6:12:53 time: 0.5525 data_time: 0.0016 memory: 10561 loss: 2.2810

LZHgrla commented 9 months ago

@zhanghui-china 一个小建议,请尽量少张贴照片,转而使用三个` 来框住文本,不然会影响其他人阅读此issue~

比如

xxxxx
xxxxx
zhanghui-china commented 9 months ago

了解了。多谢~~