Jiawei-Yang / Denoising-ViT

This is the official code release for our work, Denoising Vision Transformers.
MIT License
302 stars 8 forks source link

How can I generate your visualizations? #6

Open mranzinger opened 8 months ago

mranzinger commented 8 months ago

Hello and thank you for your excellent work!

My group recently released AM-RADIO, and I'd really like to run the same set of experiments to see if distilling from multiple ViTs with different training regimes amplifies, suppresses, or changes the artifacts. So, the first step would be to generate your Figure 1 Original images, and then to explore training the denoiser on top of it.

Could you point me to where/how to run your visualization scripts? I poked around in a couple places and couldn't find the magic command.

Thanks!

Jiawei-Yang commented 8 months ago

Hi there,

Thanks for your interest in our work! I also noticed your RADIO before and I like it very much!

Currently, we don't have a dedicated script for visualization. The original feature maps of Figure 1 are cropped from our visualization logs, similar to this image: https://github.com/Jiawei-Yang/Denoising-ViT/raw/main/demo/demo_outputs/dinov2_base_cat.jpg

To visualize this you can refer to sample_scripts/stage1_denoising.sh, but you have to modify the checkpoint loading part, which is at https://github.com/Jiawei-Yang/Denoising-ViT/blob/adeff838169152a6e55bd8e3d7f1f1befe006ff2/DenoisingViT/vit_wrapper.py#L104

Another reference for visualizing PCA maps is at: https://github.com/Jiawei-Yang/Denoising-ViT/blob/adeff838169152a6e55bd8e3d7f1f1befe006ff2/denoise_single_image.py#L105

The last reference will be the most useful one and it is independent of the majority of denoising code, allowing easy copy-paste use for your own codebase.

Best, Jiawei

mranzinger commented 8 months ago

Thank you Jiawei! I will try to give this a go this week.

mranzinger commented 7 months ago

I finally got around to implementing this based on your code. I ran it on the following models: DFN CLIP at 378px DINOv2 at 224, 378, 518px RADIOv1 at 378px RADIOv2 at 432, 512, 1024px

Results in subsequent messages.

mranzinger commented 7 months ago

DFN CLIP at 378px

vis_0 vis_1 vis_2 vis_3 vis_4 vis_5 vis_6 vis_7 vis_8 vis_9 vis_10 vis_11 vis_12 vis_13 vis_14 vis_15

mranzinger commented 7 months ago

DINOv2-g-reg at 224px

vis_0 vis_1 vis_2 vis_3 vis_4 vis_5 vis_6 vis_7 vis_8 vis_9 vis_10 vis_11 vis_12 vis_13 vis_14 vis_15

mranzinger commented 7 months ago

DINOv2-g-reg at 378px

vis_0 vis_1 vis_2 vis_3 vis_4 vis_5 vis_6 vis_7 vis_8 vis_9 vis_10 vis_11 vis_12 vis_13 vis_14 vis_15

mranzinger commented 7 months ago

DINOv2-g-reg at 518px

vis_0 vis_1 vis_2 vis_3 vis_4 vis_5 vis_6 vis_7 vis_8 vis_9 vis_10 vis_11 vis_12 vis_13 vis_14 vis_15

mranzinger commented 7 months ago

RADIOv1 at 378px

vis_0 vis_1 vis_2 vis_3 vis_4 vis_5 vis_6 vis_7 vis_8 vis_9 vis_10 vis_11 vis_12 vis_13 vis_14 vis_15

mranzinger commented 7 months ago

RADIOv2 at 432px

vis_0 vis_1 vis_2 vis_3 vis_4 vis_5 vis_6 vis_7 vis_8 vis_9 vis_10 vis_11 vis_12 vis_13 vis_14 vis_15

mranzinger commented 7 months ago

RADIOv2 at 512 and 1024px

Looks like GitHub isn't allowing me to upload more. All of the visualizations can be found here: https://drive.google.com/drive/folders/1xsmcT515n78LALV0mm1kA4hGT63zZM12?usp=sharing

I think the visualizations at 1024px are rather fascinating, as it appears as though RADIOv2 switches to "SAM mode" at that resolution, and pays close attention to contours, and object-parts are more clearly encoded.

Jiawei-Yang commented 7 months ago

Amazing visualizations! Thanks for providing these!

Re SAM --- Yes! Under the 1024px, I found the pattern to be exactly the SAM patterns. I visualized SAM at the very beginning of the project, using 518px resolution. Here is what I got at that time: image

But we didn't include SAM in our final released codebase and paper because it's not a standard ViT and requires more hacks to the timm package to make it compatible with other functionalities.


RADIOv1 seems to be very noisy and v2 is more cleaner. I will have a detailed look-through in a week. BTW, the PCA visualizations are over-blurred to me. I guess you first upsample the features then do PCA? I think doing PCA at low resolution and upsampling the color map will give you more crisp results.

mranzinger commented 7 months ago

BTW, the PCA visualizations are over-blurred to me. I guess you first upsample the features then do PCA? I think doing PCA at low resolution and upsampling the color map will give you more crisp results.

So I would take an image, interpolate it to the model resolution (e.g. 378, 432, etc.). From the model, we get $(H/p,W/p)$ spatial features, with $p$ being the patch size, and $H$, $W$ being the interpolated resolution. I then computed the PCA on the $(H/p,W/p)$ maps (e.g. I didn't upsample first). Finally, I upsample the PCA maps back to the input resolution for the model (e.g. upsample by factor $p$).

Is that the process your recommending, or is there a better algorithm?

Jiawei-Yang commented 7 months ago

Ah, then upsampling the color map using nearest interpolation becomes the key? Bilinear interpolation will result in a blurry?

mranzinger commented 7 months ago

Yep. I suppose so. I'll have to spend some more time with this.

mranzinger commented 7 months ago

Some more fun with these visualizations. Top-left is the original image, top-right is RADIO's backbone representation, bottom-left is RADIO's SAM head, and bottom-right is SAM.

I'm learning quite a bit through your work (thanks!). In particular, I think RADIOv2 has less noise than the other encoders, however, when looking at images with large regions of roughly uniform color (e.g. the gymnast), the position encoding noise becomes quite apparent. So next up is to get your denoiser working on RADIO to see how much further the output features can be refined.

vis_0 vis_1 vis_2 vis_3 vis_4 vis_5 vis_6 vis_7 vis_8 vis_9 vis_10 vis_11 vis_12 vis_13 vis_14 vis_15