Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.52k stars 241 forks source link

Question about the pre-training process #313

Closed yuezewang closed 5 months ago

yuezewang commented 7 months ago

Hello, thanks for your great work! I would like to know what is the complete process (cmds, scripts, parameters ....) of pre-training from scratch? Looking forward to your reply.

Luodian commented 7 months ago

hi did you mean pretraining on mmc4/laion2b?

yuezewang commented 7 months ago

hi did you mean pretraining on mmc4/laion2b?

Yes,pretraining on mmc4/laion2b. And the method for evaluating pretrained model is also wanted.

Luodian commented 7 months ago

I could provide you one later this week, sorry for busy at semester finals recently.

Luodian commented 7 months ago

Hi here's our pretraining script on mmc4/laion2b

export PYTHONPATH=.

accelerate launch --config_file=/home/luodian/projects/Otter/scripts/accelerate_config_fsdp.yaml \
pipeline/train/pretraining.py \
--pretrained_model_name_or_path=/home/luodian/projects/checkpoints/flamingo-mpt-30B-pretrain-mix-bf16 \
--dataset_resampled \
--batch_size_mmc4=16 \
--batch_size_laion=32 \
--num_epochs=3 \
--report_to_wandb \
--wandb_entity=ntu-slab \
--mmc4_shards=/home/luodian/projects/data/ffv3_wds/000000{000..270}.tar \
--train_num_samples_mmc4=600000 \
--laion_shards=/home/luodian/projects/data/laion400m/tar/{0000..0810}.000.tar \
--train_num_samples_laion=1200000 \
--run_name=flamingo-mpt-30B-pretrain-mix-forcebf16-stage2 \
--wandb_project=flamingo-mpt-pretrain \
--external_save_dir=/home/luodian/projects/checkpoints \
--checkpointing_steps=10000 \
--save_hf_model \
--workers=16 \
--lr_scheduler=cosine \
--delete_previous_checkpoint \
--learning_rate=1e-4 \
--warmup_steps_ratio=0.005

Here's CC3M's:

export PYTHONPATH=.

accelerate launch --config_file=/home/luodian/projects/Otter/scripts/accelerate_config_zero2.yaml \
pipeline/train/pretraining_cc3m.py \
--pretrained_model_name_or_path=/home/luodian/projects/checkpoints/flamingo-llama2-chat-13B-cc3m \
--dataset_resampled \
--batch_size_cc3m=128 \
--num_epochs=1 \
--report_to_wandb \
--wandb_entity=ntu-slab \
--cc3m_shards=/home/luodian/projects/data/cc3m/tar/00{000..311}.tar \
--train_num_samples_cc3m=3000000 \
--run_name=flamingo-llama2-chat-13B-cc3m \
--wandb_project=flamingo-llama2-pretrain \
--external_save_dir=/home/luodian/projects/checkpoints \
--checkpointing_steps=10000 \
--save_hf_model \
--workers=48 \
--lr_scheduler=cosine \
--delete_previous_checkpoint \
--learning_rate=1e-4 \
--warmup_steps_ratio=0.005

For preparing MMC4 format dataset, you need to visit: https://github.com/allenai/mmc4 For LAION, you need to visit: https://github.com/rom1504/img2dataset CC3M is also prepared with img2dataset.

This would involve a lot lot lot of efforts, wish luck to you~