Closed echo840 closed 3 months ago
Hello, in the experiments we published in our paper, we used an image pre-processor that resizes the longest edge to 432, adjusting the shortest edge to keep the original image aspect ratio, followed by a crop along the shortest edge to the nearest multiple of the patch size. This should be mostly equivalent to expand2square
followed by a resize to 432x432, only without the padding along the shortest dimension. This requires support for variable-size, non-square images.
Are you using image_aspect_ratio == 'pad'
as I suspect otherwise we might end up cropping actual pixels on the edges along the longest edge?
Thank you for your response! Yes, during finetuning, we used image_aspect_ratio == 'pad'. I'm now trying the experiment according to your instructions. Thank you very much!
Hello, RADIOV2 is still lower than SigClip. I would like to know if I have missed any operations in the feature extraction code below. Do I need to extract features from the second-to-last layer from vision tower like LLAVA? Or if I have overlooked the normalization operation? Or do I need to add the summary token?
# Feature extract
class RadioVisionTower(nn.Module):
def __init__(self, vision_tower, args, delay_load=False):
super().__init__()
self.is_loaded = False
self.vision_tower_name = vision_tower
if not delay_load:
self.load_model()
else:
self.cfg_only = AutoConfig.from_pretrained(self.vision_tower_name,trust_remote_code=True)
def load_model(self):
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
self.image_processor.do_resize = True
self.image_processor.crop_size = self.image_processor.size
self.vision_tower = AutoModel.from_pretrained(self.vision_tower_name, trust_remote_code=True)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
@torch.no_grad()
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
_ , image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0)).to(self.dtype)
image_features.append(image_feature)
else:
_ , image_features = self.vision_tower(images.to(device=self.device, dtype=self.dtype)).to(self.dtype)
return image_features
Hello, I have not worked with the HuggingFace model in LLaVA however equivalently you should be able to use the TorchHub model. In my LLaVA integration I used standard normalization instead of the built-in input conditioner (i.e. I make a call to vision_tower.make_preprocessor_external()
).
This is my code (pardon the lack of untidiness):
from argparse import Namespace
import os
import torch
import torch.nn as nn
from typing import Any, Dict
import warnings
from transformers import CLIPVisionConfig
from transformers import CLIPImageProcessor, SamImageProcessor
from PIL import Image
import numpy as np
class RADIOVisionTower(nn.Module):
"""
Vision Tower for the RADIO model.
Args:
vision_tower (str): Vision tower name. This is passed on
the command line with the `--vision_tower` argument.
The string is expected in the pattern of:
`radio:<image_size>:<checkpoint_or_version>:<extra_config>`.
Where <extra_config> is a comma-separated list of key=value pairs.
<image_size> is the image resolution.
<checkpoint> is a TorchHub version or path to a checkpoint.
args (Namespace): Arguments.
delay_load (bool): Delay loading the model.
"""
def __init__(self, vision_tower, args, delay_load=False):
"""Initialization Routine."""
super().__init__()
self.vision_tower_name = vision_tower[len("radio:"):]
config_items = self.vision_tower_name.split(":")
self.image_sizes = [int(x) for x in config_items[0].split(",")]
if len(self.image_sizes) == 0:
raise ValueError("Expected more than zero images sizes!")
self.image_size = self.image_sizes[0]
self.do_center_crop = args.mm_im_crop
self.vision_tower_checkpoint = config_items[1]
extra_config = {}
if len(config_items) > 2:
# Parse extra config items. These are provided as a comma-separated list
# of key=value pairs.
extra_config_items = config_items[2].split(",")
for item in extra_config_items:
key, value = item.split("=")
extra_config[key] = value
self.adaptor_name = extra_config.get("adaptor", "backbone")
self.fuse_adaptor_with_backbone = eval(extra_config.get("fuse_adaptor_with_backbone", "False"))
self.skip_layer_norm = eval(extra_config.get("skip_layer_norm", "False"))
self.is_loaded = False
if not delay_load:
self.load_model()
else:
# FIXME: This is a hack to avoid having to load the config from the checkpoint.
hidden_size = self.get_hidden_size()
patch_size = 16
self.cfg_only = CLIPVisionConfig(
**{
"hidden_size": hidden_size,
"image_size": self.image_size,
"model_type": "radio_vision_model",
"num_attention_heads": None,
"num_channels": 3,
"num_hidden_layers": None,
"patch_size": patch_size,
}
)
def get_hidden_size(self):
if self.adaptor_name == "openai_clip":
hidden_size = 1024
elif self.adaptor_name == "clip":
hidden_size = 1280
elif self.adaptor_name == "rtx-translate":
hidden_size = 2048
elif self.adaptor_name == "backbone":
hidden_size = 1280
else:
raise ValueError(f"Unknown adaptor name: {self.adaptor_name}")
if self.fuse_adaptor_with_backbone:
hidden_size += 1280
return hidden_size
@property
def hidden_size(self):
return self.get_hidden_size()
def load_model(self):
crop_size={'height': self.image_size, 'width': self.image_size}
if self.do_center_crop:
self.image_processor = CLIPImageProcessor(
size={"shortest_edge": self.image_size},
crop_size=crop_size,
do_center_crop=self.do_center_crop,
do_normalize=True,
)
else:
self.image_processor = SamImageProcessor(
size={"longest_edge": self.image_size},
pad_size={'height': self.image_size, 'width': self.image_size},
do_pad=False,
do_normalize=True,
)
# Add a crop_size attribute to the image processor, since the
# train.py script needs this to generate fake images of zeros
# with the right size, when the sample does not have an
# associated image.
self.image_processor.crop_size = crop_size
# For compatibility with CLIP Image Processor: the data loader uses width/height to
# create dummy blank images for samples that don't have an image.
self.image_processor.crop_size = {"width": self.image_size, "height": self.image_size}
checkpoint_path_or_version = self.vision_tower_checkpoint
# NOTE: do a lazy import of Timm to avoid issues with
# DeepSpeed's ZeRO-3.
from timm.models.vision_transformer import VisionTransformer
self.vision_tower = torch.hub.load('NVlabs/RADIO',
'radio_model',
version=checkpoint_path_or_version,
progress=True,
adaptor_names=self.adaptor_name if self.adaptor_name != "backbone" else None)
if isinstance(self.vision_tower.model, VisionTransformer):
hidden_size = self.vision_tower.model.embed_dim
else:
raise ValueError(f"Unknown model type: {self.vision_tower}")
# Override hidden size for OpenAI CLIP.
hidden_size = self.get_hidden_size()
if hasattr(self.vision_tower.model, "patch_generator"):
patch_gen = self.vision_tower.model.patch_generator
# Cropped Positional Embedding (CPE) case.
patch_size = patch_gen.patch_size
else:
# Standard ViT case.
patch_size = self.vision_tower.model.patch_embed.patch_size[0]
self.vision_tower.config = CLIPVisionConfig(
**{
"hidden_size": hidden_size,
"image_size": self.image_size,
"model_type": "radio_vision_model",
"num_attention_heads": None,
"num_channels": 3,
"num_hidden_layers": None,
"patch_size": patch_size,
}
)
self.vision_tower.make_preprocessor_external()
self.vision_tower.eval()
self.vision_tower.requires_grad_(False)
self.is_loaded = True
self._to_dtype = None
if self.skip_layer_norm:
self.vision_tower.model.norm = torch.nn.Identity()
def to(self, *args, **kwargs):
# Prevent casting the RADIO model's weights
kwargs = dict(kwargs)
self._to_dtype = kwargs.pop('dtype', None)
super().to(*args, **kwargs)
pass
def train(self, mode=True):
"""Intercept call."""
# Drop a warning if mode is True.
if mode:
warnings.warn("RADIOEncoder is always in eval mode.")
pass
@torch.no_grad()
def get_features(self, x: torch.Tensor):
output = self.vision_tower(x)
if isinstance(output, dict):
_, features = output[self.adaptor_name]
if self.fuse_adaptor_with_backbone:
_, backbone_features = output["backbone"]
features = torch.cat([features, backbone_features], dim=2)
else:
_, features = output
return features
@torch.no_grad()
def forward(self, images: torch.Tensor):
"""Main forward pass."""
input_shape = images.shape
x = images
# Add a batch dimension if necessary.
if len(input_shape) == 3:
x = x.unsqueeze(0)
# Convert the input to the model's dtype (we assume
# that the model only has one dtype for all parameters).
param0 = next(self.vision_tower.parameters())
x = x.to(dtype=param0.dtype, device=param0.device)
patch_size = self.vision_tower.config.patch_size
if self.do_center_crop:
# Crop the input to a multiple of patch size.
_, _, H, W = x.shape
H = H - (H % patch_size)
W = W - (W % patch_size)
x = x[:, :, :H, :W]
else:
# Pad to nearest multiple of patch size
_, _, H, W = x.shape
H = H + (patch_size - (H % patch_size)) % patch_size
W = W + (patch_size - (W % patch_size)) % patch_size
x = nn.functional.pad(x, (0, W - x.shape[3], 0, H - x.shape[2]), mode="constant", value=0)
features = self.get_features(x) # B, T, C
B, _, H, W = x.shape
_, _, C = features.shape
# Remove the batch dimension if we added it.
if len(input_shape) == 3:
features = features.squeeze(0)
# Cast back to the input's dtype.
features = features.to(images.dtype)
assert features.shape[-1] == self.get_hidden_size()
return features
Thank you! I‘m also curious about the setting of "extra_config" and "config_items ". Is the setting for the following parameters is true or false?
self.adaptor_name = extra_config.get("adaptor", "backbone")
self.fuse_adaptor_with_backbone = eval(extra_config.get("fuse_adaptor_with_backbone", "False"))
self.skip_layer_norm = eval(extra_config.get("skip_layer_norm", "False"))
Hi, in my standard configuration the adaptor is backbone
, and fuse_adaptor_with_backbone
and skip_layer_norm
are both False
.
Hi, in my standard configuration the adaptor is
backbone
, andfuse_adaptor_with_backbone
andskip_layer_norm
are bothFalse
.
Thank you for your prompt response and your great work!
Hello, have you been able to get RADIO to perform well in your VLM setup?
I'm sorry, to be honest, I can't achieve better results than Sigclip under the same settings. Sigclip has a resolution of 384, while Radio's resolution is dynamic (with a maximum size set to 1280). To save time, we use qwen2 0.5b as LLM. And we also add some OCR data such as docvqa and textvqa. However, the experiments are at the same setting.
Hello, have you been able to get RADIO to perform well in your VLM setup?
Hello, our RADIOv2.5 very much improves VLM metrics, see the release notes at the root of this repo. Would you like to try it?
Hello, the results of RADIOv2.5 are indeed quite impressive. I'm curious if RADIOv2.5 supports dynamic resolution. Given that different tasks and images may require different resolution settings, recent VLMs, such as the one detailed in this paper, have adopted the strategy of splitting the original image to ensure the input image resolution is close to the original, achieving very good results. Since RADIOv2.5 naturally possesses the ability to support arbitrary resolutions, I'm wondering if it's possible to use only a single instance of RADIOv2.5 to support dynamic resolution. Additionally, does RADIOv2.5 have a maximum resolution limit of 768? If it can support larger resolutions, using RADIOv2.5 as a visual encoder might yield better performance on document images (DocVQA). Thank you for your great work!
Yes, the RADIOv2.5 family of models supports dynamic resolution. Indeed, using RADIO it wouldn't be necessary to tile. It also isn't necessary that the image is square. The only requirement is that each dimension is a multiple of 16.
If you check out the tech report, you'll see that the model does well all the way up to 2048px. It can go even higher, although we haven't spent much time assessing it.
Yes, the RADIOv2.5 family of models supports dynamic resolution. Indeed, using RADIO it wouldn't be necessary to tile. It also isn't necessary that the image is square. The only requirement is that each dimension is a multiple of 16.
If you check out the tech report, you'll see that the model does well all the way up to 2048px. It can go even higher, although we haven't spent much time assessing it.
Thank you for your reply. I tried the dynamic resolution setting on radio2.1, but found the performance to be poor. I'm unsure if there have been improvements in radio2.5.
Yes, fixing "mode switching" is a major thing we fixed in the latest release. That was probably the primary reason that you were seeing weird results with dynamic resolution and RADIOv2.1. Definitely give it a try if you have the chance. The gist is that RADIOv2.1 and below were behaving differently at resolutions below ~720px versus above. The representations would dramatically change around that threshold.
This no longer happens with the new models, and we demonstrate how increasing the resolution from 432 up to 768 dramatically improves our LLaVA metrics. Depending on the language model, you could go even higher for even better results (particularly for OCR tasks).
Thank you for your response, I am very willing to give it a try on radio2.5.
Is there any good news?
Hello, thank you for your great work! We are currently exploring the utilization of radio as a vision encoder for vision language models. In our specific setup, we employ SigClip and RADIOV2 as the vision encoder, while Phi2 serves as the language model. The obtained results are as follows:
They use the same data and configuration, the only difference is the vision encoder. Is it normal to observe worse performance when using a RADIOv2 compared to using SigClip?
Could you give me some suggestions?