Closed yuezewang closed 5 months ago
hi did you mean pretraining on mmc4/laion2b?
hi did you mean pretraining on mmc4/laion2b?
Yes,pretraining on mmc4/laion2b. And the method for evaluating pretrained model is also wanted.
I could provide you one later this week, sorry for busy at semester finals recently.
Hi here's our pretraining script on mmc4/laion2b
export PYTHONPATH=.
accelerate launch --config_file=/home/luodian/projects/Otter/scripts/accelerate_config_fsdp.yaml \
pipeline/train/pretraining.py \
--pretrained_model_name_or_path=/home/luodian/projects/checkpoints/flamingo-mpt-30B-pretrain-mix-bf16 \
--dataset_resampled \
--batch_size_mmc4=16 \
--batch_size_laion=32 \
--num_epochs=3 \
--report_to_wandb \
--wandb_entity=ntu-slab \
--mmc4_shards=/home/luodian/projects/data/ffv3_wds/000000{000..270}.tar \
--train_num_samples_mmc4=600000 \
--laion_shards=/home/luodian/projects/data/laion400m/tar/{0000..0810}.000.tar \
--train_num_samples_laion=1200000 \
--run_name=flamingo-mpt-30B-pretrain-mix-forcebf16-stage2 \
--wandb_project=flamingo-mpt-pretrain \
--external_save_dir=/home/luodian/projects/checkpoints \
--checkpointing_steps=10000 \
--save_hf_model \
--workers=16 \
--lr_scheduler=cosine \
--delete_previous_checkpoint \
--learning_rate=1e-4 \
--warmup_steps_ratio=0.005
Here's CC3M's:
export PYTHONPATH=.
accelerate launch --config_file=/home/luodian/projects/Otter/scripts/accelerate_config_zero2.yaml \
pipeline/train/pretraining_cc3m.py \
--pretrained_model_name_or_path=/home/luodian/projects/checkpoints/flamingo-llama2-chat-13B-cc3m \
--dataset_resampled \
--batch_size_cc3m=128 \
--num_epochs=1 \
--report_to_wandb \
--wandb_entity=ntu-slab \
--cc3m_shards=/home/luodian/projects/data/cc3m/tar/00{000..311}.tar \
--train_num_samples_cc3m=3000000 \
--run_name=flamingo-llama2-chat-13B-cc3m \
--wandb_project=flamingo-llama2-pretrain \
--external_save_dir=/home/luodian/projects/checkpoints \
--checkpointing_steps=10000 \
--save_hf_model \
--workers=48 \
--lr_scheduler=cosine \
--delete_previous_checkpoint \
--learning_rate=1e-4 \
--warmup_steps_ratio=0.005
For preparing MMC4 format dataset, you need to visit: https://github.com/allenai/mmc4
For LAION, you need to visit: https://github.com/rom1504/img2dataset
CC3M is also prepared with img2dataset
.
This would involve a lot lot lot of efforts, wish luck to you~
Hello, thanks for your great work! I would like to know what is the complete process (cmds, scripts, parameters ....) of pre-training from scratch? Looking forward to your reply.