DINOv2 performance vs DINO

rbareja25 commented 8 months ago

hi,

I am working on a custom medical imaging binary classification task. I am comparing performances between DINO(VIT base) and DINOv2(VIT large) after training DINO and DINOv2 models, with eval_linear code. With DINO using VIT base I am getting 85% classification accuracy on the test dataset, with the model trained for 100 epochs, whereas with DINOv2 trained for 300 epochs from training_362499/teacher_checkpoint model, I am getting only 50% accuracy on the same test dataset.

I know it's hard to know the reason for this but any ideas that I could try or something that I might be doing wrong? Any suggestions would be appreciated.

Thanks, Rohan

vladchimescu commented 8 months ago

Hi @rbareja25, I am experiencing similar issues, i.e. I am struggling to reproduce DINO results using DINOv2.

In our case we used exactly the same hyper parameters as in DINO. We also conducted an ablation study, switching off KoLeo and iBOT losses without any success.

@qasfb @patricklabatut Would you expect to get comparable results to DINO if you trained DINOv2 with ibot_loss_weight = 0, koleo_loss_weight = 0 and identical hyperparameters?

alexaatm commented 7 months ago

Hi @vladchimescu @rbareja25 I faced the same issue with segmentation based on dino vs dinov2 features. Then another look up in dinov2 paper reveals the authors are aware of such a discrepency:

from dinov2 paper: dinov2

When used to extract features, it delivers disappointing performance, only on par with supervised alternative backbones in this scenario. This suggests that DINOv2 behaves differently than DINO. The investigation described in this work notably exposes the presence of artefacts in the feature maps of DINOv2 that were not present in the first version of this model

[…]

We note that we have not been able to fully determine which aspects of the training led to the appearance of artifacts in DINOv2 but not in DINO, but Fig. 4 suggests that scaling the model size beyond ViT-L, and longer training length may be possible causes

Hope it helps to clear up some confusion.

P.S. i haven't tried the dinov2 with registers though, if anything changes, I'll report here:)

vladchimescu commented 7 months ago

Hi @alexaatm, thank you for pointing this out.

@qasfb @patricklabatut I believe this is an implementation issue. We set KoLeo and iBOT loss weights to zero and we could not reproduce the good performance of DINOv1, despite using the same hyperparameters.

We haven't tried ViTs with registers as we're training DINOv2 + ViT-S. The paper that you cite @alexaatm suggests that registers are needed for large ViTs trained on massive datasets.

qasfb commented 7 months ago

Hey ! Can you say a bit more about your setup, how many GPUs, what batch size and what config in general ?

vladchimescu commented 7 months ago

@qasfb Sure! We are using NVIDIA A100 GPUs (80 GB GPU memory). For the ViT-S backbone, we only need 1 GPU, but we also tried FSDP with 2 GPUs.

We used the same batch size (256) as for DINO (Caron et al, 2021). We found that for our custom dataset, having batch size <= 256 was beneficial.

To reproduce our DINO baseline, we switched off iBOT and KoLeo losses and used the same student and teacher hyperparameters that we had previously used for DINO. I'm pasting the config file below:

MODEL:
  WEIGHTS: ''
compute_precision:
  grad_scaler: true
  teacher:
    backbone:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
    dino_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
    ibot_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
  student:
    backbone:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
    dino_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp32
        buffer_dtype: fp32
    ibot_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp32
        buffer_dtype: fp32
dino:
  loss_weight: 1.0
  head_n_prototypes: 10000
  head_bottleneck_dim: 256
  head_nlayers: 3
  head_hidden_dim: 2048
  koleo_loss_weight: 0
ibot:
  loss_weight: 0
  mask_sample_probability: 0.5
  mask_ratio_min_max:
  - 0.1
  - 0.5
  separate_head: false
  head_n_prototypes: 10000
  head_bottleneck_dim: 256
  head_nlayers: 3
  head_hidden_dim: 2048
train:
  batch_size_per_gpu: 256
  output_dir: .
  saveckp_freq: 10
  seed: 0
  num_workers: 12
  OFFICIAL_EPOCH_LENGTH: 2204
  cache_dataset: true
  centering: "centering" # or "sinkhorn_knopp"
student:
  arch: vit_small
  patch_size: 16
  drop_path_rate: 0
  layerscale: 1.0e-05
  drop_path_uniform: true
  pretrained_weights: ''
  ffn_layer: "mlp"
  block_chunks: 0
  qkv_bias: true
  proj_bias: true
  ffn_bias: true
teacher:
  momentum_teacher: 0.9995
  final_momentum_teacher: 1
  warmup_teacher_temp: 0.01
  teacher_temp: 0.04 # TODO this should be set to 0.04 (!)
  warmup_teacher_temp_epochs: 30
optim:
  epochs: 200
  weight_decay: 0.04
  weight_decay_end: 0.4
  base_lr: 0.001  # learning rate for a batch size of 256
  lr: 0.  # will be set after applying scaling rule
  warmup_epochs: 20
  min_lr: 1.0e-06
  clip_grad: 3.0
  freeze_last_layer_epochs: 3
  scaling_rule: multiple_of_256
  patch_embed_lr_mult: 0.2
  layerwise_decay: 0.9
  adamw_beta1: 0.9
  adamw_beta2: 0.999
crops:
  global_crops_scale:
  - 0.32
  - 1.0
  local_crops_number: 8
  local_crops_scale:
  - 0.05
  - 0.32
  global_crops_size: 224
  local_crops_size: 96
evaluation:
  eval_period_iterations: 12500

qasfb commented 7 months ago

How did you pick the hyperparameters in this config ? I see layerwise decay 0.9 and momentum teacher 0.9995, and i'm pretty sure these were not in DINO. Similarly the layerscale does not happen in DINO either.

vladchimescu commented 7 months ago

@qasfb The teacher momentum is from the original DINO and we used the value that produced the best results.

Indeed, layer scale was not part of the original DINO and we also ran an experiment without layer scale, which didn't make a difference.

I am not sure what layerwise_decay was and I left it as in one of the config files that I found in this repo. What would be the equivalent value of layerwise_decay for DINO(v1)?

Do you have any other pointers regarding what is different compared to vanilla DINO (after switching off iBOT and KoLeo)?

qasfb commented 7 months ago

To disable: layerwise decay = 1.0 Otherwise I don't know, I have not tried to reproduce DINO with this codebase so I don't know for sure if it would work well.

rbareja25 commented 7 months ago

My config file is here:

MODEL:
  WEIGHTS: ''
compute_precision:
  grad_scaler: true
  teacher:
    backbone:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
    dino_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
    ibot_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
  student:
    backbone:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
    dino_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp32
        buffer_dtype: fp32
    ibot_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp32
        buffer_dtype: fp32
dino:
  loss_weight: 1.0
  head_n_prototypes: 65536
  head_bottleneck_dim: 256
  head_nlayers: 3
  head_hidden_dim: 2048
  koleo_loss_weight: 0.1
ibot:
  loss_weight: 1.0
  mask_sample_probability: 0.5
  mask_ratio_min_max:
  - 0.1
  - 0.5
  separate_head: false
  head_n_prototypes: 65536
  head_bottleneck_dim: 256
  head_nlayers: 3
  head_hidden_dim: 2048
train:
  batch_size_per_gpu: 8
  dataset_path: ImageNet:split=TRAIN
  output_dir: /home/rbareja/dinov2/dinov2path_32tumors_1000patches_300ep
  saveckp_freq: 20
  seed: 0
  num_workers: 10
  OFFICIAL_EPOCH_LENGTH: 1250
  cache_dataset: true
  centering: centering
student:
  arch: vit_large
  patch_size: 16
  drop_path_rate: 0.3
  layerscale: 1.0e-05
  drop_path_uniform: true
  pretrained_weights: ''
  ffn_layer: mlp
  block_chunks: 4
  qkv_bias: true
  proj_bias: true
  ffn_bias: true
teacher:
  momentum_teacher: 0.992
  final_momentum_teacher: 1
  warmup_teacher_temp: 0.04
  teacher_temp: 0.07
  warmup_teacher_temp_epochs: 30
optim:
  epochs: 300
  weight_decay: 0.04
  weight_decay_end: 0.4
  base_lr: 0.004
  lr: 0.0007071067811865476
  warmup_epochs: 10
  min_lr: 1.0e-06
  clip_grad: 3.0
  freeze_last_layer_epochs: 1
  scaling_rule: sqrt_wrt_1024
  patch_embed_lr_mult: 0.2
  layerwise_decay: 0.9
  adamw_beta1: 0.9
  adamw_beta2: 0.999
crops:
  global_crops_scale:
  - 0.32
  - 1.0
  local_crops_number: 8
  local_crops_scale:
  - 0.05
  - 0.32
  global_crops_size: 224
  local_crops_size: 96
evaluation:
  eval_period_iterations: 12500

If we turn off the changes you mentioned, what could happen from 1.) DINOV2 performance improves as we turn off the hyperparameters that might be affecting its performance? or 2) will this basically be same as DINO?

ironb25 commented 4 months ago

Hi @vladchimescu - Were you able to improve DINOV2 performance?Could you share suggestions?

ironb25 commented 4 months ago

How did you pick the hyperparameters in this config ? I see layerwise decay 0.9 and momentum teacher 0.9995, and i'm pretty sure these were not in DINO. Similarly the layerscale does not happen in DINO either.

I see momentum teacher is present in DINO.

risratna commented 4 months ago

Hey @rbareja25 @vladchimescu @qasfb ,

What is your eta for using 1 GPU (A100 80GB), how many images are you using?

Also, @vladchimescu, you set your OFFICIAL_EPOCH_LENGTH as 2204 and batch size as 256, does this mean you have 2204 x 256 training samples?

I am trying to train it with a custom dataset as well. I am getting an ETA of (eta: 142 days, 10:53:15) (using 1 A100 80 GB GPU) (which seems very high for 80K images). Here is my config:

MODEL:
  WEIGHTS: ''
compute_precision:
  grad_scaler: true
  teacher:
    backbone:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
    dino_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
    ibot_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
  student:
    backbone:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp16
        buffer_dtype: fp32
    dino_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp32
        buffer_dtype: fp32
    ibot_head:
      sharding_strategy: SHARD_GRAD_OP
      mixed_precision:
        param_dtype: fp16
        reduce_dtype: fp32
        buffer_dtype: fp32
dino:
  loss_weight: 1.0
  head_n_prototypes: 65536
  head_bottleneck_dim: 256
  head_nlayers: 3
  head_hidden_dim: 2048
  koleo_loss_weight: 0.1
ibot:
  loss_weight: 1.0
  mask_sample_probability: 0.5
  mask_ratio_min_max:
  - 0.1
  - 0.5
  separate_head: false
  head_n_prototypes: 65536
  head_bottleneck_dim: 256
  head_nlayers: 3
  head_hidden_dim: 2048
train:
  batch_size_per_gpu: 64
  dataset_path: ChestX_ray14
  output_dir: /scratch/rnolas66/checkpoints/dinov2/vit-base-random
  saveckp_freq: 20
  seed: 0
  num_workers: 8
  OFFICIAL_EPOCH_LENGTH: 1250
  cache_dataset: true
  centering: sinkhorn_knopp
student:
  arch: vit_base
  patch_size: 14
  drop_path_rate: 0.4
  layerscale: 1.0e-05
  drop_path_uniform: true
  pretrained_weights: ''
  ffn_layer: swiglufused
  block_chunks: 4
  qkv_bias: true
  proj_bias: true
  ffn_bias: true
  num_register_tokens: 0
  interpolate_antialias: false
  interpolate_offset: 0.1
teacher:
  momentum_teacher: 0.994
  final_momentum_teacher: 1
  warmup_teacher_temp: 0.04
  teacher_temp: 0.07
  warmup_teacher_temp_epochs: 30
optim:
  epochs: 500
  weight_decay: 0.04
  weight_decay_end: 0.2
  base_lr: 0.0002
  lr: 2.1650635094610966e-05
  warmup_epochs: 80
  min_lr: 1.0e-06
  clip_grad: 3.0
  freeze_last_layer_epochs: 1
  scaling_rule: sqrt_wrt_1024
  patch_embed_lr_mult: 0.2
  layerwise_decay: 1.0
  adamw_beta1: 0.9
  adamw_beta2: 0.999
crops:
  global_crops_scale:
  - 0.32
  - 1.0
  local_crops_number: 8
  local_crops_scale:
  - 0.05
  - 0.32
  global_crops_size: 224
  local_crops_size: 98
evaluation:
  eval_period_iterations: 12500

Do you know what I could be doing wrong here?

ayushnangia commented 6 days ago

any clue about what happend?

facebookresearch / dinov2

DINOv2 performance vs DINO #272