THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
5.42k stars 445 forks source link

CUDA running out of memory for a very small dataset (7 sample training data) #281

Closed theharshithh closed 2 months ago

theharshithh commented 5 months ago

hello.

i had tried the vision fine-tuning script for glm-4v-9b model. The command i had used was python3 finetune_demo/finetune_vision.py ./data THUDM/glm-4v-9b ./finetune_demo/configs/lora.yaml

I had tried the fine-tuning for a sample dataset of 7 examples and running out of GPU (NVIDIA A100-SXM4-80GB * 8) The data is configured properly and the model is able to parse the train.jsonl, test.jsonl, val.jsonl.

GPU config: NVIDIA A100-SXM4-80GB * 8

Errors

  1. Unable to define the no of epocs
  2. Despite having the hardware requirements mentioned here, I am still getting CUDA out of Memory for 7 data samples.

The last logs before CUDA err:

***** Running training ***** Num examples = 7 Num Epochs = 375 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 2 Gradient Accumulation steps = 1 Total optimization steps = 1,500 Number of trainable parameters = 3,198,976 0%

Steps to recreate the error:

  1. The dataset structure is this format : ``
  2. Run python3 finetune_demo/finetune_vision.py ./data THUDM/glm-4v-9b ./finetune_demo/configs/lora.yaml
  3. Models gets loaded from HF hub and the dataset is being fed.
  4. When the training starts, in the first step, there is a sharp increase in GPU usage for an example of 7 samples.

Please provide a better guide and understanding towards vision finetuning.

PS;

In the GLM English docs, We have this: Execute single machine single card run through the following code.

python finetune.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml # For Chat Fine-tune python finetune.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune

Are we sure about python finetune.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune is the right command?

theharshithh commented 5 months ago

@zRzRzRzRzRzRzR

zRzRzRzRzRzRzR commented 5 months ago

batch size per device = 2 may cause problem 1 batch size use 75GB memory, this demo are not use TP, and using with DP and it may cause problem

theharshithh commented 5 months ago

Okay will try it out

theharshithh commented 5 months ago

Can you please share the complete lora yaml config that might work with the given hardware requirements?

theharshithh commented 5 months ago

@zRzRzRzRzRzRzR In the lora.yaml file

the config we use is

per_device_train_batch_size: 1

and

per_device_eval_batch_size: 1

So you are suggesting we have 2 batches per device or how

Because this current yaml config always gives me cuda out of memory.

zRzRzRzRzRzRzR commented 5 months ago

If you can't use the default configuration normally, then a larger batch size will definitely not work. This is already the minimum configuration. But after my test, even if it is an A100 graphics card with 75GB video memory, are you sure that the maximum input length and output length are both 512?

theharshithh commented 5 months ago

We had used batch size per device: 1 and not 2. We did use 2 first but changed it in the lora.yaml.

Answering your question: Yes the input length is less than 512 tokens. (less than 256 tokens)

We tried with this lora.yaml

data_config:
  train_file: train.jsonl
  val_file: val.jsonl
  test_file: test.jsonl
  num_proc: 1

max_input_length: 512
max_output_length: 512
training_args:
  output_dir: ./output
  max_steps: 128
  learning_rate: 5e-4
  per_device_train_batch_size: 1
  per_device_eval_batch_size: 1
  dataloader_num_workers: 7 
  remove_unused_columns: false
  save_strategy: steps
  save_steps: 10
  log_level: info
  logging_strategy: steps
  logging_steps: 50
  evaluation_strategy: steps
  eval_steps: 50
  predict_with_generate: true
  fp16: true 
  gradient_accumulation_steps: 1
  generation_config:
    max_new_tokens: 256
peft_config:
  peft_type: LORA
  task_type: CAUSAL_LM
  r: 1
  lora_alpha: 32
  lora_dropout: 0.1
  target_modules: ["query_key_value"]

Here is the logs before CUDA out of memory.

***** Running training *****
  Num examples = 7
  Num Epochs = 128
  Instantaneous batch size per device = 1
  Training with DataParallel so batch size has been adjusted to: 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 128
  Number of trainable parameters = 799,744
  0%|                                                                             | 0/128 [00:00<?, ?it/s]

Here is the GPU snap snot at that time-period.

Screenshot 2024-07-03 at 8 08 05 PM

We mostly suspect its a DP problem as only 7 nodes are getting used as we have 7 examples.

Love to get in touch w you bro. I am reachable at harshith@onfinance.in Help is much appreciated.

theharshithh commented 5 months ago

Update 1:

I had tried DeepSpeed with this command OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 ./finetune_demo/finetune_vision.py ./data/ THUDM/glm-4v-9b finetune_demo/configs/lora.yaml

All the GPU cores were used. When we did DP and had 2-2-3 data examples in test.jsonl, train.jsonl, and val.jsonl, only 2 cores were getting used. So we assumed DP was a problem. When ran the Deepseed command, used all cores efficiently but still running out of memory. Cuda error still persists.

Logs before error:

Screenshot 2024-07-03 at 9 41 07 PM

Please check below for the GPU usage.

Screenshot 2024-07-03 at 9 41 39 PM

Help is much appriciated. Using the same GPU config - NVIDIA A100-SXM4-80GB * 8

theharshithh commented 4 months ago

@zRzRzRzRzRzRzR Hello. Your help would be much appreciated

zRzRzRzRzRzRzR commented 4 months ago

emm, it is one gpu works?(set only one gpu and tune the model in lora) DS will not work in this demo in my test and I wrote in Readme If one gpu work, the problem maybe in DP. I also have no idea to check now

theharshithh commented 4 months ago

Can you please provide us with a fine-tuning script and a lora yaml file that tunes the model.

Can u share the one with that works with yourside? The one you guys have tested and found it to be working.

On Sun, 7 Jul, 2024, 6:59 pm zR, @.***> wrote:

emm, it is one gpu works?(set only one gpu and tune the model in lora) DS will not work in this demo in my test and I wrote in Readme If one gpu work, the problem maybe in DP. I also have no idea to check now

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2212451253, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUF7F6VODOX6IHMS4YDZLE7E3AVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGQ2TCMRVGM . You are receiving this because you authored the thread.Message ID: @.***>

zRzRzRzRzRzRzR commented 4 months ago

here is what I use

data_config:
  train_file: train.jsonl
  val_file: dev.jsonl
  test_file: dev.jsonl
  num_proc: 1
max_input_length: 512
max_output_length: 512
training_args:
  # see `transformers.Seq2SeqTrainingArguments`
  output_dir: ./output
  max_steps: 3000
  # needed to be fit for the dataset
  learning_rate: 5e-4
  # settings for data loading
  per_device_train_batch_size: 1
  dataloader_num_workers: 16
  remove_unused_columns: false
  # settings for saving checkpoints
  save_strategy: steps
  save_steps: 500
  # settings for logging
  log_level: info
  logging_strategy: steps
  logging_steps: 10
  # settings for evaluation
  per_device_eval_batch_size: 4
  evaluation_strategy: steps
  eval_steps: 500
  # settings for optimizer
  # adam_epsilon: 1e-6
  # uncomment the following line to detect nan or inf values
  # debug: underflow_overflow
  predict_with_generate: true
  # see `transformers.GenerationConfig`
  generation_config:
    max_new_tokens: 512
  # set your absolute deepspeed path here
  # deepspeed: configs/ds_zero_3.json
peft_config:
  peft_type: LORA
  task_type: CAUSAL_LM
  r: 8
  lora_alpha: 32
  lora_dropout: 0.1
  target_modules: ["query_key_value"]

and I only use 1 x A100

theharshithh commented 4 months ago

got it.

Last time when you had tried tuning this, did the training run successfully?

I'll implement the same to debug.

On Tue, 9 Jul, 2024, 12:00 pm zR, @.***> wrote:

here is what I use

data_config: train_file: train.jsonl val_file: dev.jsonl test_file: dev.jsonl num_proc: 1 max_input_length: 512 max_output_length: 512 training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output max_steps: 3000

needed to be fit for the dataset

learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps save_steps: 500

settings for logging

log_level: info logging_strategy: steps logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4 evaluation_strategy: steps eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see transformers.GenerationConfig

generation_config: max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: configs/ds_zero_3.json

peft_config: peft_type: LORA task_type: CAUSAL_LM r: 8 lora_alpha: 32 lora_dropout: 0.1 target_modules: ["query_key_value"]

and I only use 1 x A100

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2216692795, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUHBZZ36N2PNAI5N36DZLN7QRAVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGY4TENZZGU . You are receiving this because you authored the thread.Message ID: @.***>

zRzRzRzRzRzRzR commented 4 months ago

yes and I get the adapter weight, Last Time is today haha

theharshithh commented 4 months ago

also, did you change any of the other things in the finetune_vision.py?

just to be on the safe side?

On Tue, 9 Jul, 2024, 12:17 pm Harshith K, @.***> wrote:

haha thanks.

it's been fun talking to you mate.

hey im harshith btww would love to know you.

this is my twitter : https://x.com/theharshithh

On Tue, 9 Jul, 2024, 12:03 pm zR, @.***> wrote:

yes and I get the adapter weight, Last Time is today haha

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2216696202, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUCNHVS4X6ZLBQFCZ4TZLN745AVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGY4TMMRQGI . You are receiving this because you authored the thread.Message ID: @.***>

zRzRzRzRzRzRzR commented 4 months ago

change here may work

image

change it to 500(both)

theharshithh commented 4 months ago

noted. will try today.

On Tue, 9 Jul, 2024, 12:53 pm zR, @.***> wrote:

change here may work image.png (view on web) https://github.com/THUDM/GLM-4/assets/93239683/b2c607a0-18d2-4031-9079-2b0ac0d2db6c

change it to 500(both)

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2216800325, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUF3ZC2SZ6PEM5U27ATZLOFWBAVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWHAYDAMZSGU . You are receiving this because you authored the thread.Message ID: @.***>

theharshithh commented 4 months ago

@zRzRzRzRzRzRzR Heyy.

I am able save the adapter_config.json Screenshot 2024-07-10 at 6 19 07 PM

Failing after Here: Screenshot 2024-07-10 at 6 21 15 PM

## Note: We changed the Seq2Seq, so that we can set both max_steps: 30 & num_train_epochs: 3.

Its able to run 21/30 steps. Its failing under print('hitting test') if test_dataset is not None: trainer.predict(test_dataset)

Error Logs:

Screenshot 2024-07-10 at 7 30 34 PM

Can you please verify if you have tried the inference.py for the saved model adapter_configs?

Patience much appreciated!

Here is my updated repo: Link

theharshithh commented 4 months ago

Also, please share the branch of the finetune_vision.py which is working for you.

zRzRzRzRzRzRzR commented 4 months ago

get this issue, please install transformers == 4.40.2 and using main branch(latest commit) and I have not met this issue before

theharshithh commented 4 months ago

noted, Can u please let me know if a compete finetune (vision)and inference is happening successfully for you?

Just to confirm if there is any other error that might come.

On Wed, 10 Jul, 2024, 9:26 pm zR, @.***> wrote:

get this issue, please install transformers == 4.40.2 and using main branch(latest commit) and I have not met this issue before

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2220902123, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUEXFLT56CSDZGSA5STZLVKSLAVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRQHEYDEMJSGM . You are receiving this because you authored the thread.Message ID: @.***>