facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
29.79k stars 6.3k forks source link

[data2vec] different configs in data2vec_vision and paper #4333

Open kobiso opened 2 years ago

kobiso commented 2 years ago

❓ Questions and Help

Hello :) I found some different configs in data2vec_vision README and paper.

Q1: weight_decay

In data2vec_vision, the script to finetune the ViT-B model in README is as below:

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
        --model beit_base_patch16_224 \
        --finetune $CHECKPOINT \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --log_dir ${OUTPUT_DIR} --batch_size 128 --lr 4e-3 --update_freq 1 \
        --warmup_epochs 10 --epochs 100 --layer_decay 0.65 --drop_path 0.2 --drop 0.0 \
        --weight_decay 0.0 --mixup 0.8 --cutmix 1.0 --enable_deepspeed --nb_classes 1000 \
        --target_layer -1 --world_size 8 --dist_url $dist_url 

However, if I set weight_decay 0.0, error occurs as below.

Traceback (most recent call last):
  File "run_class_finetuning.py", line 713, in <module>
    main(opts, ds_init)
  File "run_class_finetuning.py", line 654, in main
    train_stats = train_one_epoch(
  File "/home/shared/workspace/nfs_mae/d2v/engine_for_finetuning.py", line 66, in train_one_epoch
    param_group["lr"] = lr_schedule_values[it] * param_group["lr_scale"]
KeyError: 'lr_scale'

If I set weight_decay 0.05 as BeiT, I could run the experiment. Can you clarify it?

Q2: warmup for finetuning ViT-B

In the paper, warmup for finetuning ViT-B is 20.

we warmup up the learning rate for 20 epochs to 0.004 for ViT-B and for 5 epochs to 0.004 for ViT-L after which the learning rate follows the cosine schedule.

However, in the above script, it is set as warmup_epochs 10. Which one is correct?

Q3: warmup for finetuning ViT-L

The script to finetuning the ViT-L model in README is as below:

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=16 run_cyclical.py \
        --model beit_large_patch16_224 \
        --finetune $CHECKPOINT \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --log_dir ${OUTPUT_DIR} --batch_size 64 --lr 5e-3 --update_freq 1 \
        --warmup_epochs $WARMUP --epochs 50 --layer_decay 0.65 --drop_path 0.25 --drop 0.0 \
        --weight_decay 0.05 --mixup 0.8 --cutmix 1.0 --enable_deepspeed --nb_classes 1000 --seed 0\
        --target_layer -1 --world_size 16 --dist_url $dist_url --attn_drop_rate 0.0

What is $WARMUP in the script? Is it 5 as mentioned in the paper?

Thanks in advance!

Yingdong-Hu commented 2 years ago

@alexeib I can't reproduce the classifiaction accuracy (84.2%) reported in the paper using the following script

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
        --model beit_base_patch16_224 \
        --finetune $CHECKPOINT \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} --log_dir ${OUTPUT_DIR} --batch_size 128 --lr 4e-3 --update_freq 1 \
        --warmup_epochs 10 --epochs 100 --layer_decay 0.65 --drop_path 0.2 --drop 0.0 \
        --weight_decay 0.0 --mixup 0.8 --cutmix 1.0 --enable_deepspeed --nb_classes 1000 \
        --target_layer -1 --world_size 8 --dist_url $dist_url 

The final accuracy I got is 83.96%.

I noticed that you provided finetuned checkpoints. As a sanity check, can you provide the commands to run evaluation using your ImageNet fine-tuned models?

kobiso commented 2 years ago

Update for my initial question. I could run the run_class_finetuning.py with --weight_decay 0.0 if I used --enable_deepspeed. However, the result was 83.894, which is lower than paper's score (84.2%) as @Alxead mentioned. How should I reproduce the paper's score properly?

alexeib commented 2 years ago

@arbabu123