Closed theharshithh closed 2 months ago
@zRzRzRzRzRzRzR
batch size per device = 2
may cause problem
1 batch size use 75GB memory, this demo are not use TP, and using with DP and it may cause problem
Okay will try it out
Can you please share the complete lora yaml config that might work with the given hardware requirements?
@zRzRzRzRzRzRzR In the lora.yaml
file
the config we use is
per_device_train_batch_size: 1
and
per_device_eval_batch_size: 1
So you are suggesting we have 2 batches per device or how
Because this current yaml config always gives me cuda out of memory.
If you can't use the default configuration normally, then a larger batch size will definitely not work. This is already the minimum configuration. But after my test, even if it is an A100 graphics card with 75GB video memory, are you sure that the maximum input length and output length are both 512?
We had used batch size per device: 1
and not 2. We did use 2 first but changed it in the lora.yaml
.
Answering your question: Yes the input length is less than 512 tokens. (less than 256 tokens)
We tried with this lora.yaml
data_config:
train_file: train.jsonl
val_file: val.jsonl
test_file: test.jsonl
num_proc: 1
max_input_length: 512
max_output_length: 512
training_args:
output_dir: ./output
max_steps: 128
learning_rate: 5e-4
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
dataloader_num_workers: 7
remove_unused_columns: false
save_strategy: steps
save_steps: 10
log_level: info
logging_strategy: steps
logging_steps: 50
evaluation_strategy: steps
eval_steps: 50
predict_with_generate: true
fp16: true
gradient_accumulation_steps: 1
generation_config:
max_new_tokens: 256
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 1
lora_alpha: 32
lora_dropout: 0.1
target_modules: ["query_key_value"]
***** Running training *****
Num examples = 7
Num Epochs = 128
Instantaneous batch size per device = 1
Training with DataParallel so batch size has been adjusted to: 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 128
Number of trainable parameters = 799,744
0%| | 0/128 [00:00<?, ?it/s]
Here is the GPU snap snot at that time-period.
We mostly suspect its a DP problem as only 7 nodes are getting used as we have 7 examples.
Love to get in touch w you bro. I am reachable at harshith@onfinance.in
Help is much appreciated.
Update 1:
I had tried DeepSpeed with this command OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 ./finetune_demo/finetune_vision.py ./data/ THUDM/glm-4v-9b finetune_demo/configs/lora.yaml
All the GPU cores were used. When we did DP and had 2-2-3 data examples in test.jsonl
, train.jsonl
, and val.jsonl
, only 2 cores were getting used. So we assumed DP was a problem. When ran the Deepseed
command, used all cores efficiently but still running out of memory. Cuda error still persists.
Please check below for the GPU usage.
Help is much appriciated. Using the same GPU config - NVIDIA A100-SXM4-80GB * 8
@zRzRzRzRzRzRzR Hello. Your help would be much appreciated
emm, it is one gpu works?(set only one gpu and tune the model in lora) DS will not work in this demo in my test and I wrote in Readme If one gpu work, the problem maybe in DP. I also have no idea to check now
Can you please provide us with a fine-tuning script and a lora yaml file that tunes the model.
Can u share the one with that works with yourside? The one you guys have tested and found it to be working.
On Sun, 7 Jul, 2024, 6:59 pm zR, @.***> wrote:
emm, it is one gpu works?(set only one gpu and tune the model in lora) DS will not work in this demo in my test and I wrote in Readme If one gpu work, the problem maybe in DP. I also have no idea to check now
— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2212451253, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUF7F6VODOX6IHMS4YDZLE7E3AVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGQ2TCMRVGM . You are receiving this because you authored the thread.Message ID: @.***>
here is what I use
data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: dev.jsonl
num_proc: 1
max_input_length: 512
max_output_length: 512
training_args:
# see `transformers.Seq2SeqTrainingArguments`
output_dir: ./output
max_steps: 3000
# needed to be fit for the dataset
learning_rate: 5e-4
# settings for data loading
per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false
# settings for saving checkpoints
save_strategy: steps
save_steps: 500
# settings for logging
log_level: info
logging_strategy: steps
logging_steps: 10
# settings for evaluation
per_device_eval_batch_size: 4
evaluation_strategy: steps
eval_steps: 500
# settings for optimizer
# adam_epsilon: 1e-6
# uncomment the following line to detect nan or inf values
# debug: underflow_overflow
predict_with_generate: true
# see `transformers.GenerationConfig`
generation_config:
max_new_tokens: 512
# set your absolute deepspeed path here
# deepspeed: configs/ds_zero_3.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
target_modules: ["query_key_value"]
and I only use 1 x A100
got it.
Last time when you had tried tuning this, did the training run successfully?
I'll implement the same to debug.
On Tue, 9 Jul, 2024, 12:00 pm zR, @.***> wrote:
here is what I use
data_config: train_file: train.jsonl val_file: dev.jsonl test_file: dev.jsonl num_proc: 1 max_input_length: 512 max_output_length: 512 training_args:
see
transformers.Seq2SeqTrainingArguments
output_dir: ./output max_steps: 3000
needed to be fit for the dataset
learning_rate: 5e-4
settings for data loading
per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false
settings for saving checkpoints
save_strategy: steps save_steps: 500
settings for logging
log_level: info logging_strategy: steps logging_steps: 10
settings for evaluation
per_device_eval_batch_size: 4 evaluation_strategy: steps eval_steps: 500
settings for optimizer
adam_epsilon: 1e-6
uncomment the following line to detect nan or inf values
debug: underflow_overflow
predict_with_generate: true
see
transformers.GenerationConfig
generation_config: max_new_tokens: 512
set your absolute deepspeed path here
deepspeed: configs/ds_zero_3.json
peft_config: peft_type: LORA task_type: CAUSAL_LM r: 8 lora_alpha: 32 lora_dropout: 0.1 target_modules: ["query_key_value"]
and I only use 1 x A100
— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2216692795, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUHBZZ36N2PNAI5N36DZLN7QRAVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGY4TENZZGU . You are receiving this because you authored the thread.Message ID: @.***>
yes and I get the adapter weight, Last Time is today haha
also, did you change any of the other things in the finetune_vision.py?
just to be on the safe side?
On Tue, 9 Jul, 2024, 12:17 pm Harshith K, @.***> wrote:
haha thanks.
it's been fun talking to you mate.
hey im harshith btww would love to know you.
this is my twitter : https://x.com/theharshithh
On Tue, 9 Jul, 2024, 12:03 pm zR, @.***> wrote:
yes and I get the adapter weight, Last Time is today haha
— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2216696202, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUCNHVS4X6ZLBQFCZ4TZLN745AVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGY4TMMRQGI . You are receiving this because you authored the thread.Message ID: @.***>
change here may work
change it to 500(both)
noted. will try today.
On Tue, 9 Jul, 2024, 12:53 pm zR, @.***> wrote:
change here may work image.png (view on web) https://github.com/THUDM/GLM-4/assets/93239683/b2c607a0-18d2-4031-9079-2b0ac0d2db6c
change it to 500(both)
— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2216800325, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUF3ZC2SZ6PEM5U27ATZLOFWBAVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWHAYDAMZSGU . You are receiving this because you authored the thread.Message ID: @.***>
@zRzRzRzRzRzRzR Heyy.
I am able save the adapter_config.json
Failing after Here:
## Note: We changed the Seq2Seq, so that we can set both max_steps: 30
& num_train_epochs: 3
.
Its able to run 21/30 steps. Its failing under print('hitting test') if test_dataset is not None: trainer.predict(test_dataset)
Error Logs:
Can you please verify if you have tried the inference.py
for the saved model adapter_configs?
Patience much appreciated!
Here is my updated repo: Link
Also, please share the branch of the finetune_vision.py
which is working for you.
get this issue, please install transformers == 4.40.2 and using main branch(latest commit) and I have not met this issue before
noted, Can u please let me know if a compete finetune (vision)and inference is happening successfully for you?
Just to confirm if there is any other error that might come.
On Wed, 10 Jul, 2024, 9:26 pm zR, @.***> wrote:
get this issue, please install transformers == 4.40.2 and using main branch(latest commit) and I have not met this issue before
— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-4/issues/281#issuecomment-2220902123, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5V2IUEXFLT56CSDZGSA5STZLVKSLAVCNFSM6AAAAABKH6ZZCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRQHEYDEMJSGM . You are receiving this because you authored the thread.Message ID: @.***>
hello.
i had tried the vision fine-tuning script for
glm-4v-9b
model. The command i had used waspython3 finetune_demo/finetune_vision.py ./data THUDM/glm-4v-9b ./finetune_demo/configs/lora.yaml
I had tried the fine-tuning for a sample dataset of 7 examples and running out of GPU
(NVIDIA A100-SXM4-80GB * 8)
The data is configured properly and the model is able to parse thetrain.jsonl
,test.jsonl
,val.jsonl
.GPU config:
NVIDIA A100-SXM4-80GB * 8
Errors
The last logs before CUDA err:
***** Running training ***** Num examples = 7 Num Epochs = 375 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 2 Gradient Accumulation steps = 1 Total optimization steps = 1,500 Number of trainable parameters = 3,198,976 0%
Steps to recreate the error:
python3 finetune_demo/finetune_vision.py ./data THUDM/glm-4v-9b ./finetune_demo/configs/lora.yaml
Please provide a better guide and understanding towards vision finetuning.
PS;
In the GLM English docs, We have this: Execute single machine single card run through the following code.
python finetune.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml # For Chat Fine-tune
python finetune.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
Are we sure about
python finetune.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
is the right command?