LTH14 / rcg

PyTorch implementation of RCG https://arxiv.org/abs/2312.03701
MIT License
785 stars 36 forks source link

Enquiries about choice of SSL methods #3

Closed jinhong-ni closed 8 months ago

jinhong-ni commented 8 months ago

Dear Authors,

Thanks for open-sourcing your great work! I have a few questions regarding the choice of the SSL image encoder in your pipeline.

In the paper, the experiments were mainly conducted with contrastive learning methods like MoCo-v3. These methods without doubt offer compact and informative representation space for conditioning. The denoising process is operated over the representation space. A decoder (pixel generator) is then separately trained to map the representation back to pixel space.

My concern arises from this aspect: masked image modeling (MIM) methods like MAE train a decoder jointly with the image encoder, thus it seems to naturally fit better for representation conditioning and pixel reconstruction. Is there any reason that you pick contrastive learning methods over MIM methods? What are the main benefits gained from employing contrastive learning rather than MIM?

Thanks in advance for your answer. Please correct me if I made any mistake about your work.

LTH14 commented 8 months ago

Thanks for your interest! Our choice of the pre-trained encoder is mainly based on its linear probing performance, as this pre-trained encoder is fixed during RCG's training. We also tried an image encoder supervisedly trained with DeiT, which also yielded good results (5.51 FID and 211.7 IS as shown in the paper). MIM-based methods such as MAE provide great fine-tuning performance, but their linear probing performance lags behind contrastive learning. As a result, we tried MAE-based image encoder, but the performance was not as good as encoders pre-trained with contrastive learning.

jinhong-ni commented 8 months ago

Thanks for your instantaneous and constructive response! This clarifies most of my concerns. Just a couple of follow-up questions.

  1. When you experiment with MAE-based image encoders, did you try to finetune the MAE decoder or did you only experiment with training a separate pixel generator?
  2. For linear probing, could you please provide some insights on why MAE-based methods might lag behind the contrastive learning ones?
LTH14 commented 8 months ago
  1. We do not fine-tune the MAE decoder, as it only possesses reconstruction ability but not generation ability (the reconstructive process of MAE is deterministic). When ablating different encoders, our pixel generator is set to MAGE-B.
  2. As shown in the MAE and Moco paper, on ImageNet-1K, ViT-B trained with MAE achieves 68.0% linear probing accuracy, while ViT-B trained with Moco v3 achieves 76.7% linear probing accuracy.
jinhong-ni commented 8 months ago

Thanks again for your detailed responses! These address my concerns.