Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.55k stars 242 forks source link

Question about Multi-gpu training. #242

Open qyx1121 opened 1 year ago

qyx1121 commented 1 year ago

Hi, I trained on eight 48G gpus, using the following training script files:

export PYTHONPATH=.

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml \
pipeline/train/instruction_following.py \
--pretrained_model_name_or_path=luodian/OTTER-LLaMA7B-INIT  \ # or --pretrained_model_name_or_path=luodian/OTTER-MPT7B-Init
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \
--batch_size=32 \
--num_epochs=3 \
--report_to_wandb \
--wandb_entity=ntu-slab \
--run_name=OTTER-LLaMA7B-densecaption \
--wandb_project=OTTER-LLaMA7B \
--workers=4 \
--lr_scheduler=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01

and I didn't change anything in this file "./pipeline/accelerate_configs/accelerate_config_fsdp.yaml”. When I trained, I found that except for the last gpu, the utilization rate of other Gpus was very low. Is this phenomenon normal? 1692100838853 1692100848576