OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
MIT License
4.32k stars 448 forks source link

Finetune模型时遇到的一些问题 #223

Closed HuangZiy closed 10 months ago

HuangZiy commented 10 months ago

此领域小白,目前在跟着教程进行 finetune 时对结果上有一些疑问。希望大佬可以指导一下。

背景:

40-50 个左右的玩具角色分类

打算的实现方式:

通过50个左右的文本标签和图片归一化求max,成功识别出时哪一款玩具

训练的数据集:

每个标签30-50张 * 48 ≈ 1700(图片背景基本一样)

配置和参数:

单卡 显存 32G

#!/usr/bin/env

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training). 
# Please set the options below according to the comments. 
# For multi-gpu workers training, these options should be manually set for each worker. 
# After setting the options, please run the script on each worker.
# Command: bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}

# Number of GPUs per GPU worker
GPUS_PER_NODE=1 
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=localhost
# The port for communication
export MASTER_PORT=8514
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=0 

export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip/

DATAPATH=${1}

# data options
train_data=${DATAPATH}/datasets/**/lmdb/train
val_data=${DATAPATH}/datasets/**/lmdb/valid # if val_data is not specified, the validation will be automatically disabled

# restore options
resume=${DATAPATH}/pretrained_weights/clip_cn_vit-b-16.pt # or specify your customed ckpt path to resume
reset_data_offset="--reset-data-offset"
reset_optimizer="--reset-optimizer"
# reset_optimizer=""

# output options
output_base_dir=${DATAPATH}/experiments/
name=muge_finetune_vit-b-16_roberta-base_bs128_1gpu_22
save_step_frequency=999999 # disable it
save_epoch_frequency=100
log_interval=1
report_training_batch_acc="--report-training-batch-acc"
# report_training_batch_acc=""

# training hyper-params
context_length=52
warmup=100
batch_size=150
valid_batch_size=150
accum_freq=1
lr=15e-5
wd=0.001
max_epochs=800 # or you can alternatively specify --max-steps
valid_step_interval=999999
valid_epoch_interval=999999
vision_model=ViT-B-16
text_model=RoBERTa-wwm-ext-base-chinese
use_augment="--use-augment"
# use_augment=""

python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
          --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
          --train-data=${train_data} \
          --val-data=${val_data} \
          --resume=${resume} \
          ${reset_data_offset} \
          ${reset_optimizer} \
          --logs=${output_base_dir} \
          --name=${name} \
          --save-step-frequency=${save_step_frequency} \
          --save-epoch-frequency=${save_epoch_frequency} \
          --log-interval=${log_interval} \
          ${report_training_batch_acc} \
          --context-length=${context_length} \
          --warmup=${warmup} \
          --batch-size=${batch_size} \
          --valid-batch-size=${valid_batch_size} \
          --valid-step-interval=${valid_step_interval} \
          --valid-epoch-interval=${valid_epoch_interval} \
          --accum-freq=${accum_freq} \
          --lr=${lr} \
          --wd=${wd} \
          --max-epochs=${max_epochs} \
          --vision-model=${vision_model} \
          ${use_augment} \
          --text-model=${text_model} \
          --grad-checkpointing

训练日志: 看起来 模型收敛在这个范围

2023-10-23,12:10:54 | INFO | Rank 0 | Global Steps: 7969/9600 | Train Epoch: 665 [150/1800 (8%)] | Loss: 1.408282 | Image2Text Acc: 29.33 | Text2Image Acc: 26.00 | Data Time: 12.606s | Batch Time: 13.296s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:10:56 | INFO | Rank 0 | Global Steps: 7970/9600 | Train Epoch: 665 [300/1800 (17%)] | Loss: 1.338785 | Image2Text Acc: 32.00 | Text2Image Acc: 30.67 | Data Time: 1.709s | Batch Time: 2.382s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:10:57 | INFO | Rank 0 | Global Steps: 7971/9600 | Train Epoch: 665 [450/1800 (25%)] | Loss: 1.400602 | Image2Text Acc: 26.67 | Text2Image Acc: 26.00 | Data Time: 0.056s | Batch Time: 0.729s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:10:57 | INFO | Rank 0 | Global Steps: 7972/9600 | Train Epoch: 665 [600/1800 (33%)] | Loss: 1.333308 | Image2Text Acc: 28.00 | Text2Image Acc: 29.33 | Data Time: 0.053s | Batch Time: 0.726s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:11:05 | INFO | Rank 0 | Global Steps: 7973/9600 | Train Epoch: 665 [750/1800 (42%)] | Loss: 1.274247 | Image2Text Acc: 34.67 | Text2Image Acc: 32.67 | Data Time: 7.118s | Batch Time: 7.790s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:11:08 | INFO | Rank 0 | Global Steps: 7974/9600 | Train Epoch: 665 [900/1800 (50%)] | Loss: 1.272779 | Image2Text Acc: 32.67 | Text2Image Acc: 28.67 | Data Time: 2.264s | Batch Time: 2.937s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:11:09 | INFO | Rank 0 | Global Steps: 7975/9600 | Train Epoch: 665 [1050/1800 (58%)] | Loss: 1.371383 | Image2Text Acc: 32.00 | Text2Image Acc: 32.00 | Data Time: 0.053s | Batch Time: 0.725s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:11:10 | INFO | Rank 0 | Global Steps: 7976/9600 | Train Epoch: 665 [1200/1800 (67%)] | Loss: 1.302166 | Image2Text Acc: 35.33 | Text2Image Acc: 26.00 | Data Time: 0.052s | Batch Time: 0.725s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:11:21 | INFO | Rank 0 | Global Steps: 7977/9600 | Train Epoch: 665 [1350/1800 (75%)] | Loss: 1.261887 | Image2Text Acc: 32.00 | Text2Image Acc: 34.00 | Data Time: 11.000s | Batch Time: 11.673s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:11:22 | INFO | Rank 0 | Global Steps: 7978/9600 | Train Epoch: 665 [1500/1800 (83%)] | Loss: 1.318105 | Image2Text Acc: 30.67 | Text2Image Acc: 31.33 | Data Time: 0.056s | Batch Time: 0.728s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:11:23 | INFO | Rank 0 | Global Steps: 7979/9600 | Train Epoch: 665 [1650/1800 (92%)] | Loss: 1.307291 | Image2Text Acc: 34.00 | Text2Image Acc: 33.33 | Data Time: 0.055s | Batch Time: 0.733s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:11:24 | INFO | Rank 0 | Global Steps: 7980/9600 | Train Epoch: 665 [1800/1800 (100%)] | Loss: 1.278893 | Image2Text Acc: 30.67 | Text2Image Acc: 32.00 | Data Time: 0.051s | Batch Time: 0.723s | LR: 0.000011 | logit_scale: 2.761 | Global Batch Size: 150
2023-10-23,12:11:24 | INFO | Rank 0 | train LMDB file contains 1769 images and 1769 pairs.
2023-10-23,12:11:24 | INFO | Rank 0 | val LMDB file contains 611 images and 611 pairs.
2023-10-23,12:11:39 | INFO | Rank 0 | Saved checkpoint /root/autodl-tmp/data/experiments/muge_finetune_vit-b-16_roberta-base_bs128_1gpu_22/checkpoints/epoch_latest.pt (epoch 665 @ 7980 steps) (writing took 14.912322759628296 seconds)

由于我没有开validation 所以没有 valid 的loss情况

现状

最后我手动测试了几个训练集的数据,基本上文本和图片的归一化的最大值在 99 以上,但是测试其他的照片有些可以识别有些无法识别有些甚至识别错误( 不正确的文本和图片的归一化的最大值达到了 99)泛化能力比较差

如何优化?

  1. 请问现在是欠拟合还是过拟合了呢?
  2. 数据集是否需要优化?
  3. 参数是否需要优化?
  4. 该如何调整才能保证模型的性能(泛化性强 分类准确)?
  5. 还是我的实现方式需要优化呢?

这是我遇到的一些问题,希望大佬们可以帮忙给出一些指导性的建议,十分感谢。

RobinHan24 commented 10 months ago

请问你在制作数据集的时候,同一个玩具角色,文本都是一模一样吗

HuangZiy commented 10 months ago

是的。上面说的问题在我更换更大的模型后解决掉了。