X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Apache License 2.0
1.12k stars 68 forks source link

issue while finetuning DocOwl1.5-Omni on dataset #78

Open AkshataABhat opened 1 month ago

AkshataABhat commented 1 month ago

The training does not start..my memory is completely occupied but GPU is at 0%. Screenshot attached below. Pls help .

image

HAWLYQ commented 1 month ago

Hi, @AkshataABhat, could you provide more details, such as the GPU device and training script?

AkshataABhat commented 1 month ago

@HAWLYQ GPU is NVIDIA A100-SXM4-40GB

Training script is:

#!/bin/bash
if [ $MASTER_ADDR ];then
    echo $MASTER_ADDR
    echo $MASTER_PORT
    echo $WORLD_SIZE
    echo $RANK
else
    MASTER_ADDR=127.0.0.1
    MASTER_PORT=2$(($RANDOM % 10))$(($RANDOM % 10))15
    WORLD_SIZE=1
    RANK=0
fi
# Change for multinode config
NNODES=${WORLD_SIZE}
NODE_RANK=${RANK}
GPUS_PER_NODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
# GPUS_PER_NODE=1
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
echo $DISTRIBUTED_ARGS

# change LOAD to your local path of DocOwl1.5-stage1
LOAD='mPLUG/DocOwl1.5-Omni'

# batch size = per_device_train_batch_size x GPUS_PER_NODE x NNODES x gradient_accumulation_steps
DATA_FILE=train.jsonl
torchrun $DISTRIBUTED_ARGS mplug_docowl/train/train_docowl.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --vision2text_lr 2e-5 \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path $LOAD \
    --version v1 \
    --data_path $DATA_FILE \
    --image_folder 'DocOwl1.5/answers/images' \
    --image_size 448 \
    --crop_anchors 'grid_9' \
    --add_global_img True \
    --output_dir 'output'
    --add_textual_crop_indicator True \
    --bf16 True \
    --output_dir ./checkpoints/docowl1.5-lora \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 4 \
    --learning_rate 1e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 3600 \
    --gradient_checkpointing True \
    --tune_vision2text True \
    --freeze_vision_model True \
    --freeze_backbone True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to tensorboard
HAWLYQ commented 1 month ago

Hi, @AkshataABhat, the training script seems ok~ I have tested the script with A100-80G and am not sure whether it works well on A100-40G~ We will try whether it works on V100-32G, but due to the work schedule and limited machine resources, this won't be soon, sry for that~

AkshataABhat commented 1 month ago

@HAWLYQ here, I am loading the model from hugging face instead of a local checkpoint.

LOAD='mPLUG/DocOwl1.5-Omni'

also, in train_docowl.py, the code is getting executed until the below line:

data_module = make_supervised_data_module(tokenizer=tokenizer,
                                              data_args=data_args)

about 35 GB is occupied until this step. after this, trainer is not being called:

trainer.train()

pls guide whether this is a gpu issue? or would the script work if checkpoints were locally available.