Closed StarCycle closed 9 months ago
image_processor
and visual_encoder
for llava model, and you can replace it in there lines, https://github.com/InternLM/xtuner/blob/9664051a963623b7d78bc3dc2db65eb1ee73482b/xtuner/configs/llava/internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py#L60-L63
https://github.com/InternLM/xtuner/blob/9664051a963623b7d78bc3dc2db65eb1ee73482b/xtuner/configs/llava/internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py#L83-L85
The model will automatically build these modules with its type
and the followed arguments.Or, you can modify the code in LLaVAModel to build your own model https://github.com/InternLM/xtuner/blob/9664051a963623b7d78bc3dc2db65eb1ee73482b/xtuner/model/llava.py#L35-L36
ProjectorConfig
and ProjectorModel
to get your own architecture. Then, you should expand the LLM's embed
and lm_head
, you can do this after these lines
Also, please do not forget to fine-tuning embed and lm_head layers if you add and use custom tokens.
Great thanks! I will get back to this issue if I have other questions
Hello, I tried to change the vision model from clip to dinov2, but I get this error:
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same
This error occurred in line 365 in modeling_dinov2.py (link). , i.e.,
embedding_output = self.embeddings(pixel_values, bool_masked_pos=bool_masked_pos)
Here, the pixel_values
are torch.float32 and weights of a CNN in self.embedding
are torch.bfloat16.
A interesting fact is that when I train original internlm-llava model (with openai clip), in line 841 of modeling_clip.py (link), there is
hidden_states = self.embeddings(pixel_values)
The pixel_values
are also torch.float32 and weights of a CNN in self.embedding
are also torch.bfloat16, but I never get the error! (you can see this in the following figure)
Do you have any clue of this problem?
How I modified the code:
I only modified llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py. I replaced CLIPImageProcessor with AutoImageProcessor, and replaced CLIPVisionModel with Dinov2Model in huggingface transformer.
The full modified config:
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig,
AutoImageProcessor, Dinov2Model,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine import DatasetInfoHook, EvaluateChatHook
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = '../internlm-chat-7b'
visual_encoder_name_or_path = '../dinov2-large'
# Data
data_root = './'
data_path = data_root + 'blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'images'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 12 # per_device
accumulative_counts = 1
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=AutoImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
visual_encoder=dict(
type=Dinov2Model.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
T_max=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(by_epoch=True, max_epochs=max_epochs, val_interval=1)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per epoch.
checkpoint=dict(type=CheckpointHook, interval=1),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
Note that it is similar to issue 289. I use the command xtuner train $CFG --deepspeed deepspeed_zero2
@StarCycle Hi!
I find that CLIP's embedding layer automatically changes the dtype of pixel_values
, in https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/clip/modeling_clip.py#L181-L182
So, maybe we can manually change the dtype of pixel_values
in DINOv2 model.
If it works after that, I think we can post a PR for transformers
@LZHgrla It does work!
I modified https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/dinov2/modeling_dinov2.py#L119 to the following code:
target_dtype = self.patch_embeddings.projection.weight.dtype
embeddings = self.patch_embeddings(pixel_values.to(dtype=target_dtype))
I also modified https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/dinov2/modeling_dinov2.py#L106 to:
patch_pos_embed = nn.functional.interpolate(
patch_pos_embed.to(dtype=torch.float32),
scale_factor=(float(height / math.sqrt(num_positions)), float(width / math.sqrt(num_positions))),
mode="bicubic",
align_corners=False,
).to(dtype=target_dtype)
Now it does work correctly in a single 4090 or 2 4090:
The GPU utilization is also fine:
Before posting a PR to huggingface, I have a stupid question...The CLIPVisionModel and Dinov2Model are stored in torch.float32, and in the config file, the torch_dtype
of llm is torch.float16. Why the parameters of the vision model are torch.bfloat16?
After I transform the images to torch.bfloat16, the training works. Again, there should be an error since torch_dtype
of llm is torch.float16.
Does xtuner automatically handle dtype transformation somewhere?
Sorry the modification in https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/dinov2/modeling_dinov2.py#L106 should be
target_dtype = patch_pos_embed.dtype
patch_pos_embed = nn.functional.interpolate(
patch_pos_embed.to(dtype=torch.float32),
scale_factor=(float(height / math.sqrt(num_positions)), float(width / math.sqrt(num_positions))),
mode="bicubic",
align_corners=False,
).to(dtype=target_dtype)
@StarCycle Yes, xtuner will automatically handle the dtype. More precisely, the dtype depends on the optimizer.
The default optimizer is AmpOptimizer
So, all model should in float32 by default.
However, we use deepspeed to accelerate the training of llava, and the default deepspeed optimizer is bf16/fp16 optimizer, https://github.com/InternLM/xtuner/blob/ceeb9be1191c7128169571b58e0ee221ea21c60f/xtuner/configs/deepspeed/deepspeed_zero2.json#L11-L17
Thank you! I submit a PR to the huggingface team.
I want to modify the code of internlm-llava-7b, such as:
How can I do these? I cannot find the source code of llama or llava in xtuner repo.