InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.92k stars 305 forks source link

How to modify the code of internlm-llava-7b? #297

Closed StarCycle closed 9 months ago

StarCycle commented 10 months ago

I want to modify the code of internlm-llava-7b, such as:

  1. Change the vision encoder from CLIP ViT to DINO v2. They share the same architecture (ViT) but have different weights.
  2. Modify the code of projector, e.g., adding layers to the projector.
  3. Add custom token to the llama tokenizer.

How can I do these? I cannot find the source code of llama or llava in xtuner repo.

LZHgrla commented 9 months ago
  1. XTuner build image_processor and visual_encoder for llava model, and you can replace it in there lines, https://github.com/InternLM/xtuner/blob/9664051a963623b7d78bc3dc2db65eb1ee73482b/xtuner/configs/llava/internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py#L60-L63 https://github.com/InternLM/xtuner/blob/9664051a963623b7d78bc3dc2db65eb1ee73482b/xtuner/configs/llava/internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py#L83-L85 The model will automatically build these modules with its type and the followed arguments.

Or, you can modify the code in LLaVAModel to build your own model https://github.com/InternLM/xtuner/blob/9664051a963623b7d78bc3dc2db65eb1ee73482b/xtuner/model/llava.py#L35-L36

  1. Here is the code of projector, and you can modify the ProjectorConfig and ProjectorModel to get your own architecture.

https://github.com/InternLM/xtuner/blob/9664051a963623b7d78bc3dc2db65eb1ee73482b/xtuner/model/llava.py#L40-L45

  1. To replace the tokenizer, you should first modify the config to use the new tokenizer, as https://github.com/InternLM/xtuner/blob/9664051a963623b7d78bc3dc2db65eb1ee73482b/xtuner/configs/llava/internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py#L54-L58

Then, you should expand the LLM's embed and lm_head, you can do this after these lines

https://github.com/InternLM/xtuner/blob/9664051a963623b7d78bc3dc2db65eb1ee73482b/xtuner/model/llava.py#L33-38

LZHgrla commented 9 months ago

Also, please do not forget to fine-tuning embed and lm_head layers if you add and use custom tokens.

StarCycle commented 9 months ago

Great thanks! I will get back to this issue if I have other questions

StarCycle commented 9 months ago

Hello, I tried to change the vision model from clip to dinov2, but I get this error:

RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

This error occurred in line 365 in modeling_dinov2.py (link). , i.e.,

embedding_output = self.embeddings(pixel_values, bool_masked_pos=bool_masked_pos)

Here, the pixel_values are torch.float32 and weights of a CNN in self.embedding are torch.bfloat16.

A interesting fact is that when I train original internlm-llava model (with openai clip), in line 841 of modeling_clip.py (link), there is

hidden_states = self.embeddings(pixel_values)

The pixel_values are also torch.float32 and weights of a CNN in self.embedding are also torch.bfloat16, but I never get the error! (you can see this in the following figure)

original clip

Do you have any clue of this problem?

How I modified the code:

I only modified llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py. I replaced CLIPImageProcessor with AutoImageProcessor, and replaced CLIPVisionModel with Dinov2Model in huggingface transformer.

The full modified config:

# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
                            LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig, 
                          AutoImageProcessor, Dinov2Model, 
                          CLIPImageProcessor, CLIPVisionModel)

from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine import DatasetInfoHook, EvaluateChatHook
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
llm_name_or_path = '../internlm-chat-7b'
visual_encoder_name_or_path = '../dinov2-large'

# Data
data_root = './'
data_path = data_root + 'blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'images'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = int(2048 - (336 / 14)**2)

# Scheduler & Optimizer
batch_size = 12  # per_device
accumulative_counts = 1
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip
warmup_ratio = 0.03

# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']

#######################################################################
#            PART 2  Model & Tokenizer & Image Processor              #
#######################################################################
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=llm_name_or_path,
    trust_remote_code=True,
    padding_side='right')

image_processor = dict(
    type=AutoImageProcessor.from_pretrained,
    pretrained_model_name_or_path=visual_encoder_name_or_path,
    trust_remote_code=True)

model = dict(
    type=LLaVAModel,
    freeze_llm=True,
    freeze_visual_encoder=True,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=llm_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16,
        quantization_config=dict(
            type=BitsAndBytesConfig,
            load_in_4bit=True,
            load_in_8bit=False,
            llm_int8_threshold=6.0,
            llm_int8_has_fp16_weight=False,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4')),
    visual_encoder=dict(
        type=Dinov2Model.from_pretrained,
        pretrained_model_name_or_path=visual_encoder_name_or_path))

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
llava_dataset = dict(
    type=LLaVADataset,
    data_path=data_path,
    image_folder=image_folder,
    tokenizer=tokenizer,
    image_processor=image_processor,
    dataset_map_fn=llava_map_fn,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    max_length=max_length,
    pad_image_to_square=False)

train_dataloader = dict(
    batch_size=batch_size,
    num_workers=dataloader_num_workers,
    dataset=llava_dataset,
    sampler=dict(type=DefaultSampler, shuffle=True),
    collate_fn=dict(type=default_collate_fn))

#######################################################################
#                    PART 4  Scheduler & Optimizer                    #
#######################################################################
# optimizer
optim_wrapper = dict(
    type=AmpOptimWrapper,
    optimizer=dict(
        type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
    clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
    accumulative_counts=accumulative_counts,
    loss_scale='dynamic',
    dtype='float16')

# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md  # noqa: E501
param_scheduler = [
    dict(
        type=LinearLR,
        start_factor=1e-5,
        by_epoch=True,
        begin=0,
        end=warmup_ratio * max_epochs,
        convert_to_iter_based=True),
    dict(
        type=CosineAnnealingLR,
        eta_min=0.0,
        by_epoch=True,
        begin=warmup_ratio * max_epochs,
        T_max=max_epochs,
        convert_to_iter_based=True)
]

# train, val, test setting
train_cfg = dict(by_epoch=True, max_epochs=max_epochs, val_interval=1)

#######################################################################
#                           PART 5  Runtime                           #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
    dict(
        type=EvaluateChatHook,
        tokenizer=tokenizer,
        image_processor=image_processor,
        every_n_iters=evaluation_freq,
        evaluation_inputs=evaluation_inputs,
        evaluation_images=evaluation_images,
        system=SYSTEM,
        prompt_template=prompt_template)
]

# configure default hooks
default_hooks = dict(
    # record the time of every iteration.
    timer=dict(type=IterTimerHook),
    # print log every 100 iterations.
    logger=dict(type=LoggerHook, interval=10),
    # enable the parameter scheduler.
    param_scheduler=dict(type=ParamSchedulerHook),
    # save checkpoint per epoch.
    checkpoint=dict(type=CheckpointHook, interval=1),
    # set sampler seed in distributed evrionment.
    sampler_seed=dict(type=DistSamplerSeedHook),
)

# configure environment
env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,
    # set multi process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)

# set visualizer
visualizer = None

# set log level
log_level = 'INFO'

# load from which checkpoint
load_from = None

# whether to resume training from the loaded checkpoint
resume = False

# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
StarCycle commented 9 months ago

Note that it is similar to issue 289. I use the command xtuner train $CFG --deepspeed deepspeed_zero2

LZHgrla commented 9 months ago

@StarCycle Hi!

I find that CLIP's embedding layer automatically changes the dtype of pixel_values, in https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/clip/modeling_clip.py#L181-L182

So, maybe we can manually change the dtype of pixel_values in DINOv2 model.

If it works after that, I think we can post a PR for transformers

StarCycle commented 9 months ago

@LZHgrla It does work!

I modified https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/dinov2/modeling_dinov2.py#L119 to the following code:

target_dtype = self.patch_embeddings.projection.weight.dtype
embeddings = self.patch_embeddings(pixel_values.to(dtype=target_dtype))

I also modified https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/dinov2/modeling_dinov2.py#L106 to:

patch_pos_embed = nn.functional.interpolate(
    patch_pos_embed.to(dtype=torch.float32),
    scale_factor=(float(height / math.sqrt(num_positions)), float(width / math.sqrt(num_positions))),
    mode="bicubic",
    align_corners=False,
).to(dtype=target_dtype)

Now it does work correctly in a single 4090 or 2 4090:

e993f6a87a56d60c60fd0017d2fb604

The GPU utilization is also fine:

036a5543491fd4ab7c0a509f3bc2947

Before posting a PR to huggingface, I have a stupid question...The CLIPVisionModel and Dinov2Model are stored in torch.float32, and in the config file, the torch_dtype of llm is torch.float16. Why the parameters of the vision model are torch.bfloat16?

After I transform the images to torch.bfloat16, the training works. Again, there should be an error since torch_dtype of llm is torch.float16.

Does xtuner automatically handle dtype transformation somewhere?

StarCycle commented 9 months ago

Sorry the modification in https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/dinov2/modeling_dinov2.py#L106 should be

target_dtype = patch_pos_embed.dtype
patch_pos_embed = nn.functional.interpolate(
    patch_pos_embed.to(dtype=torch.float32),
    scale_factor=(float(height / math.sqrt(num_positions)), float(width / math.sqrt(num_positions))),
    mode="bicubic",
    align_corners=False,
).to(dtype=target_dtype)
LZHgrla commented 9 months ago

@StarCycle Yes, xtuner will automatically handle the dtype. More precisely, the dtype depends on the optimizer.

The default optimizer is AmpOptimizer

https://github.com/InternLM/xtuner/blob/ceeb9be1191c7128169571b58e0ee221ea21c60f/xtuner/configs/llava/internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py#L113-L120

So, all model should in float32 by default.

However, we use deepspeed to accelerate the training of llava, and the default deepspeed optimizer is bf16/fp16 optimizer, https://github.com/InternLM/xtuner/blob/ceeb9be1191c7128169571b58e0ee221ea21c60f/xtuner/configs/deepspeed/deepspeed_zero2.json#L11-L17

StarCycle commented 9 months ago

Thank you! I submit a PR to the huggingface team.