InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.73k stars 302 forks source link

【Program hangs with no output.】 #626

Open Luo-Z13 opened 4 months ago

Luo-Z13 commented 4 months ago

I am conducting the instruction tuning of llama3_llava using the script on my own dataset NPROC_PER_NODE=${GPU_NUM} xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero3_offload --seed 1024. After the following output, the program stops outputting but is still running:

 - mmengine - INFO - Iter(train) [   10/23076]  lr: 1.3034e-07  eta: 3 days, 3:30:35  time: 11.7851  data_time: 0.0298  memory: 15547  loss: nan
 - mmengine - INFO - Iter(train) [   20/23076]  lr: 2.7506e-07  eta: 3 days, 5:46:56  time: 12.5050  data_time: 0.0199  memory: 9964  loss: nan

This state has been ongoing for 2 hours. What could be the possible cause for this?

LZHgrla commented 4 months ago

@Luo-Z13 The total number of iterations is a bit strange. Did you modify the settings in config?

Luo-Z13 commented 4 months ago

@Luo-Z13 The total number of iterations is a bit strange. Did you modify the settings in config?

My script:

NPROC_PER_NODE=${GPU_NUM} xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune \
                                       --deepspeed deepspeed_zero3_offload --seed 1024

The training schedule:

# Scheduler & Optimizer
batch_size = 4  # per_device
accumulative_counts = 4
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip
warmup_ratio = 0.03

Then, I modify the save_steps, and change other parts about the paths to my own data or local paths. Besides that, there were no other changes.

LZHgrla commented 4 months ago

@Luo-Z13 How many GPUs are you using for training?

Luo-Z13 commented 4 months ago

@Luo-Z13 How many GPUs are you using for training?

I use 4*A100(40G)

Luo-Z13 commented 4 months ago

@Luo-Z13 How many GPUs are you using for training?

And the pre-training of LLaVA-llama3 is normal.

LZHgrla commented 4 months ago

@Luo-Z13

Under your configuration, the total dataset size is 4 4 23076 = 369216. However, the correct size of llava fine-tuning dataset is ~650000. This mismatch size seems a bit unusual. Have you modified the training data?

Luo-Z13 commented 4 months ago

@Luo-Z13

Under your configuration, the total dataset size is 4 4 23076 = 369216. However, the correct size of llava fine-tuning dataset is ~650000. This mismatch size seems a bit unusual. Have you modified the training data?

Hello, I'm using my own instruction-tuning data, so the total number of iterations is different. Do I need to check the format of my dataset?

LZHgrla commented 4 months ago

@Luo-Z13

Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.

Additionally, here are some other suggestions:

  1. Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
  2. Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)
Luo-Z13 commented 4 months ago

@Luo-Z13

Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.

Additionally, here are some other suggestions:

  1. Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
  2. Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)

Thank you very much, I will try them.

Luo-Z13 commented 4 months ago

@Luo-Z13

Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.

Additionally, here are some other suggestions:

  1. Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
  2. Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)

Thank you for your suggestions. The loss is now normal, but there is a new problem: after training a few batches, the iteration speed becomes very slow, as shown:

...
04/30 01:03:42 - mmengine - INFO - Iter(train) [  100/23072]  lr: 2.8656e-06  eta: 1 day, 13:54:11  time: 5.0159  data_time: 0.0117  memory: 9195  loss: 1.8088
04/30 01:04:31 - mmengine - INFO - Iter(train) [  110/23072]  lr: 3.1550e-06  eta: 1 day, 13:16:56  time: 4.8975  data_time: 0.0158  memory: 9167  loss: 1.3998
04/30 01:05:52 - mmengine - INFO - Iter(train) [  120/23072]  lr: 3.4444e-06  eta: 1 day, 14:26:13  time: 8.0493  data_time: 0.0101  memory: 9146  loss: 1.3203
04/30 01:06:42 - mmengine - INFO - Iter(train) [  130/23072]  lr: 3.7339e-06  eta: 1 day, 13:56:50  time: 5.0641  data_time: 0.0210  memory: 9125  loss: 1.2123
04/30 01:07:35 - mmengine - INFO - Iter(train) [  140/23072]  lr: 4.0233e-06  eta: 1 day, 13:37:29  time: 5.2818  data_time: 0.0184  memory: 9104  loss: 1.0494
04/30 03:07:10 - mmengine - INFO - Iter(train) [  150/23072]  lr: 4.3127e-06  eta: 14 days, 3:40:15  time: 717.5106  data_time: 0.0726  memory: 9090  loss: 0.8822
04/30 03:42:26 - mmengine - INFO - Iter(train) [  160/23072]  lr: 4.6022e-06  eta: 16 days, 18:28:43  time: 211.6158  data_time: 0.1037  memory: 9069  loss: 0.8258
04/30 06:08:50 - mmengine - INFO - Iter(train) [  170/23072]  lr: 4.8916e-06  eta: 29 days, 11:20:41  time: 878.3878  data_time: 0.1637  memory: 9055  loss: 0.7201
04/30 09:23:10 - mmengine - INFO - Iter(train) [  180/23072]  lr: 5.1810e-06  eta: 44 days, 23:40:10  time: 1165.9963  data_time: 0.1712  memory: 9041  loss: 0.7931

What could be the cause of this? @LZHgrla

LZHgrla commented 4 months ago

@Luo-Z13 It seems to the fluctuations in machine performance. Can this issue be reliably reproduced and which commands you used?