OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
MIT License
4.48k stars 462 forks source link

torch.load(args.resume, map_location="cpu") 加载clip_cn_rn50.pt时报错_pickle.UnpicklingError: invalid load key, '\xf7' #21

Closed ZhaoyingAC closed 1 year ago

ZhaoyingAC commented 1 year ago

torch==1.9.0 torchvision==0.10.0 lmdb==1.3.0 cuda version 10.2 上面是我的环境配置,我跑默认的clip_cn_vit-b-16.pt是可以finetuing,但是换成clip_cn_rn50.pt就失败了。下面是启动脚本中修改的地方。 checkpoint=clip_cn_rn50.pt vision_model=RN50 text_model=RBT3-chinese

yangapku commented 1 year ago

您好,方便您提供下完整的脚本文件吗?最好能一并提供下完整的log

ZhaoyingAC commented 1 year ago

您好,方便您提供下完整的脚本文件吗?最好能一并提供下完整的log

脚本

#!/usr/bin/env

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training). 
# Please set the options below according to the comments. 
# For multi-gpu workers training, these options should be manually set for each worker. 
# After setting the options, please run the script on each worker.
# Command: bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}

# Number of GPUs per GPU worker
GPUS_PER_NODE=1
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=localhost
# The port for communication
export MASTER_PORT=8514
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=0 

export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip/

DATAPATH=${1}

# data options
train_data=${DATAPATH}/datasets/MUGE/lmdb/train
val_data=${DATAPATH}/datasets/MUGE/lmdb/valid # if val_data is not specified, the validation will be automatically disabled

# restore options
resume=${DATAPATH}/pretrained_weights/clip_cn_rn50.pt # or specify your customed ckpt path to resume
reset_data_offset="--reset-data-offset"
reset_optimizer="--reset-optimizer"
# reset_optimizer=""

# output options
output_base_dir=${DATAPATH}/experiments/
name=muge_finetune_vit-b-16_roberta-base_bs128_8gpu
save_step_frequency=999999 # disable it
save_epoch_frequency=1
log_interval=1
report_training_batch_acc="--report-training-batch-acc"
# report_training_batch_acc=""

# training hyper-params
context_length=52
warmup=100
batch_size=128
valid_batch_size=128
lr=5e-5
wd=0.001
max_epochs=3
valid_step_interval=150
valid_epoch_interval=1
vision_model=RN50
text_model=RBT3-chinese
use_augment="--use-augment"
# use_augment=""

python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
          --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
          --train-data=${train_data} \
          --val-data=${val_data} \
          --resume=${resume} \
          ${reset_data_offset} \
          ${reset_optimizer} \
          --logs=${output_base_dir} \
          --name=${name} \
          --save-step-frequency=${save_step_frequency} \
          --save-epoch-frequency=${save_epoch_frequency} \
          --log-interval=${log_interval} \
          ${report_training_batch_acc} \
          --context-length=${context_length} \
          --warmup=${warmup} \
          --batch-size=${batch_size} \
          --valid-batch-size=${valid_batch_size} \
          --valid-step-interval=${valid_step_interval} \
          --valid-epoch-interval=${valid_epoch_interval} \
          --lr=${lr} \
          --wd=${wd} \
          --max-epochs=${max_epochs} \
          --vision-model=${vision_model} \
          ${use_augment} \
          --text-model=${text_model}

日志

cd Chinese-CLIP && sh run_scripts/muge_finetune_vit-b-16_rbt-base.sh ../../pretrainedModel/
anaconda3/envs/py38/lib/python3.9/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : cn_clip/training/main.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : localhost:8514
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_7g4dmrxl/none_xivgw557
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
anaconda3/envs/py38/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=localhost
  master_port=8514
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_7g4dmrxl/none_xivgw557/attempt_0/0/error.json
Loading vision model config from Chinese-CLIP/cn_clip/clip/model_configs/RN50.json
Loading text model config from Chinese-CLIP/cn_clip/clip/model_configs/RBT3-chinese.json
2022-11-29,19:12:54 | INFO | Rank 0 | train LMDB file contains 129380 images and 250314 pairs.
2022-11-29,19:12:54 | INFO | Rank 0 | val LMDB file contains 29806 images and 30588 pairs.
2022-11-29,19:12:54 | INFO | Rank 0 | Params:
2022-11-29,19:12:54 | INFO | Rank 0 |   aggregate: True
2022-11-29,19:12:54 | INFO | Rank 0 |   batch_size: 128
2022-11-29,19:12:54 | INFO | Rank 0 |   bert_weight_path: None
2022-11-29,19:12:54 | INFO | Rank 0 |   beta1: 0.9
2022-11-29,19:12:54 | INFO | Rank 0 |   beta2: 0.999
2022-11-29,19:12:54 | INFO | Rank 0 |   checkpoint_path: ../../pretrainedModel//experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/checkpoints
2022-11-29,19:12:54 | INFO | Rank 0 |   clip_weight_path: None
2022-11-29,19:12:54 | INFO | Rank 0 |   context_length: 52
2022-11-29,19:12:54 | INFO | Rank 0 |   debug: False
2022-11-29,19:12:54 | INFO | Rank 0 |   device: cuda:0
2022-11-29,19:12:54 | INFO | Rank 0 |   eps: 1e-08
2022-11-29,19:12:54 | INFO | Rank 0 |   freeze_vision: False
2022-11-29,19:12:54 | INFO | Rank 0 |   grad_checkpointing: False
2022-11-29,19:12:54 | INFO | Rank 0 |   local_device_rank: 0
2022-11-29,19:12:54 | INFO | Rank 0 |   local_rank: 0
2022-11-29,19:12:54 | INFO | Rank 0 |   log_interval: 1
2022-11-29,19:12:54 | INFO | Rank 0 |   log_level: 20
2022-11-29,19:12:54 | INFO | Rank 0 |   log_path: ../../pretrainedModel//experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/out_2022-11-29-19-12-49.log
2022-11-29,19:12:54 | INFO | Rank 0 |   logs: ../../pretrainedModel//experiments/
2022-11-29,19:12:54 | INFO | Rank 0 |   lr: 5e-05
2022-11-29,19:12:54 | INFO | Rank 0 |   max_epochs: 3
2022-11-29,19:12:54 | INFO | Rank 0 |   max_steps: 5868
2022-11-29,19:12:54 | INFO | Rank 0 |   name: muge_finetune_vit-b-16_roberta-base_bs128_8gpu
2022-11-29,19:12:54 | INFO | Rank 0 |   num_workers: 4
2022-11-29,19:12:54 | INFO | Rank 0 |   precision: amp
2022-11-29,19:12:54 | INFO | Rank 0 |   rank: 0
2022-11-29,19:12:54 | INFO | Rank 0 |   report_training_batch_acc: True
2022-11-29,19:12:54 | INFO | Rank 0 |   reset_data_offset: True
2022-11-29,19:12:54 | INFO | Rank 0 |   reset_optimizer: True
2022-11-29,19:12:54 | INFO | Rank 0 |   resume: ../../pretrainedModel//pretrained_weights/clip_cn_rn50.pt
2022-11-29,19:12:54 | INFO | Rank 0 |   save_epoch_frequency: 1
2022-11-29,19:12:54 | INFO | Rank 0 |   save_step_frequency: 999999
2022-11-29,19:12:54 | INFO | Rank 0 |   seed: 123
2022-11-29,19:12:54 | INFO | Rank 0 |   skip_aggregate: False
2022-11-29,19:12:54 | INFO | Rank 0 |   skip_scheduler: False
2022-11-29,19:12:54 | INFO | Rank 0 |   text_model: RBT3-chinese
2022-11-29,19:12:54 | INFO | Rank 0 |   train_data: ../../pretrainedModel//datasets/MUGE/lmdb/train
2022-11-29,19:12:54 | INFO | Rank 0 |   use_augment: True
2022-11-29,19:12:54 | INFO | Rank 0 |   use_bn_sync: False
2022-11-29,19:12:54 | INFO | Rank 0 |   val_data: ../../pretrainedModel//datasets/MUGE/lmdb/valid
2022-11-29,19:12:54 | INFO | Rank 0 |   valid_batch_size: 128
2022-11-29,19:12:54 | INFO | Rank 0 |   valid_epoch_interval: 1
2022-11-29,19:12:54 | INFO | Rank 0 |   valid_step_interval: 150
2022-11-29,19:12:54 | INFO | Rank 0 |   vision_model: RN50
2022-11-29,19:12:54 | INFO | Rank 0 |   warmup: 100
2022-11-29,19:12:54 | INFO | Rank 0 |   wd: 0.001
2022-11-29,19:12:54 | INFO | Rank 0 |   world_size: 1
2022-11-29,19:12:54 | INFO | Rank 0 | Use GPU: 0 for training
2022-11-29,19:12:54 | INFO | Rank 0 | => begin to load checkpoint '../../pretrainedModel//pretrained_weights/clip_cn_rn50.pt'
Traceback (most recent call last):
  File "Chinese-CLIP/cn_clip/training/main.py", line 279, in <module>
    main()
  File "Chinese-CLIP/cn_clip/training/main.py", line 199, in main
    checkpoint = torch.load(args.resume, map_location="cpu")
  File "anaconda3/envs/py38/lib/python3.9/site-packages/torch/serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "anaconda3/envs/py38/lib/python3.9/site-packages/torch/serialization.py", line 882, in _load
    result = unpickler.load()
_pickle.UnpicklingError: invalid load key, '\xf7'.
Exception in thread ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 34447) of binary: anaconda3/envs/py38/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=localhost
  master_port=8514
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_7g4dmrxl/none_xivgw557/attempt_1/0/error.json
yangapku commented 1 year ago

@zhaoyinghe 收到,我们看下。另外也麻烦您校验下ckpt是否下载完整,请您执行下md5sum clip_cn_rn50.pt,校验其MD5是否为b84a90869f59f0bf2e5cd53e1b7ce533。如果不是的话,可能就是下载不完整的问题,请您重新下载ckpt后试试哈。

ZhaoyingAC commented 1 year ago

@zhaoyinghe 收到,我们看下。另外也麻烦您校验下ckpt是否下载完整,请您执行下md5sum clip_cn_rn50.pt,校验其MD5是否为b84a90869f59f0bf2e5cd53e1b7ce533。如果不是的话,可能就是下载不完整的问题,请您重新下载ckpt后试试哈。

您好,md5确实不一样,我重试了几次都是一样,我这边是“9cd36409e2a01a026f1e5812b219ee11“。不过文中提供的四个模型(除了huge-size)checkpoint都是按照相同的方式下载和使用的,只有resnet-50的不行,您那边方便从md中resnet50对应的checkpoint链接下载看看md5编码是多少?

yangapku commented 1 year ago

b84a90869f59f0bf2e5cd53e1b7ce533这个就来自我们md提供的链接下载的文件哈

yangapku commented 1 year ago

我给您提供个备份链接,您也试下 https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/checkpoints/clip_cn_rn50.pt

ZhaoyingAC commented 1 year ago

clip_cn_rn50.pt

两个链接的md5编码都对上了,可以跑起来了。之前的pt看来有点问题,虽然size一样。谢谢解答