Open AkshataABhat opened 1 month ago
Hi, @AkshataABhat, could you provide more details, such as the GPU device and training script?
@HAWLYQ GPU is NVIDIA A100-SXM4-40GB
Training script is:
#!/bin/bash
if [ $MASTER_ADDR ];then
echo $MASTER_ADDR
echo $MASTER_PORT
echo $WORLD_SIZE
echo $RANK
else
MASTER_ADDR=127.0.0.1
MASTER_PORT=2$(($RANDOM % 10))$(($RANDOM % 10))15
WORLD_SIZE=1
RANK=0
fi
# Change for multinode config
NNODES=${WORLD_SIZE}
NODE_RANK=${RANK}
GPUS_PER_NODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
# GPUS_PER_NODE=1
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
echo $DISTRIBUTED_ARGS
# change LOAD to your local path of DocOwl1.5-stage1
LOAD='mPLUG/DocOwl1.5-Omni'
# batch size = per_device_train_batch_size x GPUS_PER_NODE x NNODES x gradient_accumulation_steps
DATA_FILE=train.jsonl
torchrun $DISTRIBUTED_ARGS mplug_docowl/train/train_docowl.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --vision2text_lr 2e-5 \
--deepspeed ./scripts/zero2.json \
--model_name_or_path $LOAD \
--version v1 \
--data_path $DATA_FILE \
--image_folder 'DocOwl1.5/answers/images' \
--image_size 448 \
--crop_anchors 'grid_9' \
--add_global_img True \
--output_dir 'output'
--add_textual_crop_indicator True \
--bf16 True \
--output_dir ./checkpoints/docowl1.5-lora \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 4 \
--learning_rate 1e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 3600 \
--gradient_checkpointing True \
--tune_vision2text True \
--freeze_vision_model True \
--freeze_backbone True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to tensorboard
Hi, @AkshataABhat, the training script seems ok~ I have tested the script with A100-80G and am not sure whether it works well on A100-40G~ We will try whether it works on V100-32G, but due to the work schedule and limited machine resources, this won't be soon, sry for that~
@HAWLYQ here, I am loading the model from hugging face instead of a local checkpoint.
LOAD='mPLUG/DocOwl1.5-Omni'
also, in train_docowl.py, the code is getting executed until the below line:
data_module = make_supervised_data_module(tokenizer=tokenizer,
data_args=data_args)
about 35 GB is occupied until this step. after this, trainer is not being called:
trainer.train()
pls guide whether this is a gpu issue? or would the script work if checkpoints were locally available.
The training does not start..my memory is completely occupied but GPU is at 0%. Screenshot attached below. Pls help .