Fine-tuning on downstream tasks directly

NVlabs / RADIO

Official repository for "AM-RADIO: Reduce All Domains Into One"

Other

605 stars 23 forks source link

Fine-tuning on downstream tasks directly #84

Open githubiubiu opened 2 weeks ago

githubiubiu commented 2 weeks ago

Hello ! Thanks for this amazing work, I want to know how to use the radio model for fine-tuning on downstream tasks (maybe not classification tasks). For vit-L/14, is it possible to load only the backbone parameters including multiple cls tokens (like loading imagenet pre-trained weights), or is it necessary to load dino/clip head? My downstream task is similar to defect detection. Thank you very much for your reply！

githubiubiu commented 2 weeks ago

And when I replaced the existing DINOv2 vit-L/14 (pretrained weights only) with RADIO vit-L/14, the accuracy has dropped. I am not sure if it is caused by the incorrect use of RADIO, which bothers me.

gheinrich commented 2 weeks ago

Hello, yes RADIO is very much designed to be used in a downstream application. We usually keep the backbone frozen and train a task-specific head on top of the shared backbone features. Have you seen the semantic segmentation example? RADIO was also integrated as a backbone in Probe3D.

Can I check with you that your inputs into RADIO are RGB values in the [0,1] range?

githubiubiu commented 2 weeks ago

Thanks for your quick reply. I followed your instructions to freeze the entire backbone and only train the head, but it didn't work for my task. I added lora to the backbone, which gave me better results, but still inferior to dinov2. And for data preprocessing, I first normalized and regularized the img and replaced the input_conditioner with nn.Identity().

mranzinger commented 1 week ago

One thing that comes to mind is that RADIOv2.5-L is a ViT-L/16 model, not an /14 model. Have you ensured that you're handling that difference in patch sizes properly? For example, running DINOv2 at 448px is equivalent to running RADIOv2.5-L at 512px given that it's the same number of tokens processed by either model +/- some negligible compute.

githubiubiu commented 5 days ago

Yes, I noticed this diff at first, but I found that in the paper vit-l is vit-l/14 instead of vit-l/16. And the patch size in the code can easily achieve interpolation from 16 to 14, so I used the interpolated patch size (16 -> 14). Will this be the key to the performance degradation? I will try it experimentally. Thank you for your reply.

mranzinger commented 5 days ago

Yeah, very possible that interpolating to patch 14 is causing enough of an issue to degrade results. The choice between 14 and 16 is tricky for our models. I suppose I personally prefer 16 because it's a better number for computing. From a modeling standpoint, this choice mostly affects what we call "effective resolution" which is essentially the number of patch rows and columns. So if you have a ViT-L/14 at resolution 448, then it's roughly identical to a ViT-L/16 at resolution 512. Both have an effective resolution of 32x32 in this case.

Because you're using DINO-L/14, you'll want to account for the effective resolution when comparing to RADIOv2.5-L/16 by scaling the input resolution by 16/14 when handing to RADIO. In doing so, you'll get an identical number of output patches between the two models, and also each input patch will encompass the exact same image content, making the comparison fair.