Closed ZhaoyingAC closed 1 year ago
您好,方便您提供下完整的脚本文件吗?最好能一并提供下完整的log
您好,方便您提供下完整的脚本文件吗?最好能一并提供下完整的log
脚本
#!/usr/bin/env
# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training).
# Please set the options below according to the comments.
# For multi-gpu workers training, these options should be manually set for each worker.
# After setting the options, please run the script on each worker.
# Command: bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}
# Number of GPUs per GPU worker
GPUS_PER_NODE=1
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=localhost
# The port for communication
export MASTER_PORT=8514
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=0
export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip/
DATAPATH=${1}
# data options
train_data=${DATAPATH}/datasets/MUGE/lmdb/train
val_data=${DATAPATH}/datasets/MUGE/lmdb/valid # if val_data is not specified, the validation will be automatically disabled
# restore options
resume=${DATAPATH}/pretrained_weights/clip_cn_rn50.pt # or specify your customed ckpt path to resume
reset_data_offset="--reset-data-offset"
reset_optimizer="--reset-optimizer"
# reset_optimizer=""
# output options
output_base_dir=${DATAPATH}/experiments/
name=muge_finetune_vit-b-16_roberta-base_bs128_8gpu
save_step_frequency=999999 # disable it
save_epoch_frequency=1
log_interval=1
report_training_batch_acc="--report-training-batch-acc"
# report_training_batch_acc=""
# training hyper-params
context_length=52
warmup=100
batch_size=128
valid_batch_size=128
lr=5e-5
wd=0.001
max_epochs=3
valid_step_interval=150
valid_epoch_interval=1
vision_model=RN50
text_model=RBT3-chinese
use_augment="--use-augment"
# use_augment=""
python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
--master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
--train-data=${train_data} \
--val-data=${val_data} \
--resume=${resume} \
${reset_data_offset} \
${reset_optimizer} \
--logs=${output_base_dir} \
--name=${name} \
--save-step-frequency=${save_step_frequency} \
--save-epoch-frequency=${save_epoch_frequency} \
--log-interval=${log_interval} \
${report_training_batch_acc} \
--context-length=${context_length} \
--warmup=${warmup} \
--batch-size=${batch_size} \
--valid-batch-size=${valid_batch_size} \
--valid-step-interval=${valid_step_interval} \
--valid-epoch-interval=${valid_epoch_interval} \
--lr=${lr} \
--wd=${wd} \
--max-epochs=${max_epochs} \
--vision-model=${vision_model} \
${use_augment} \
--text-model=${text_model}
日志
cd Chinese-CLIP && sh run_scripts/muge_finetune_vit-b-16_rbt-base.sh ../../pretrainedModel/
anaconda3/envs/py38/lib/python3.9/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : cn_clip/training/main.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : localhost:8514
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_7g4dmrxl/none_xivgw557
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
anaconda3/envs/py38/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=localhost
master_port=8514
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_7g4dmrxl/none_xivgw557/attempt_0/0/error.json
Loading vision model config from Chinese-CLIP/cn_clip/clip/model_configs/RN50.json
Loading text model config from Chinese-CLIP/cn_clip/clip/model_configs/RBT3-chinese.json
2022-11-29,19:12:54 | INFO | Rank 0 | train LMDB file contains 129380 images and 250314 pairs.
2022-11-29,19:12:54 | INFO | Rank 0 | val LMDB file contains 29806 images and 30588 pairs.
2022-11-29,19:12:54 | INFO | Rank 0 | Params:
2022-11-29,19:12:54 | INFO | Rank 0 | aggregate: True
2022-11-29,19:12:54 | INFO | Rank 0 | batch_size: 128
2022-11-29,19:12:54 | INFO | Rank 0 | bert_weight_path: None
2022-11-29,19:12:54 | INFO | Rank 0 | beta1: 0.9
2022-11-29,19:12:54 | INFO | Rank 0 | beta2: 0.999
2022-11-29,19:12:54 | INFO | Rank 0 | checkpoint_path: ../../pretrainedModel//experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/checkpoints
2022-11-29,19:12:54 | INFO | Rank 0 | clip_weight_path: None
2022-11-29,19:12:54 | INFO | Rank 0 | context_length: 52
2022-11-29,19:12:54 | INFO | Rank 0 | debug: False
2022-11-29,19:12:54 | INFO | Rank 0 | device: cuda:0
2022-11-29,19:12:54 | INFO | Rank 0 | eps: 1e-08
2022-11-29,19:12:54 | INFO | Rank 0 | freeze_vision: False
2022-11-29,19:12:54 | INFO | Rank 0 | grad_checkpointing: False
2022-11-29,19:12:54 | INFO | Rank 0 | local_device_rank: 0
2022-11-29,19:12:54 | INFO | Rank 0 | local_rank: 0
2022-11-29,19:12:54 | INFO | Rank 0 | log_interval: 1
2022-11-29,19:12:54 | INFO | Rank 0 | log_level: 20
2022-11-29,19:12:54 | INFO | Rank 0 | log_path: ../../pretrainedModel//experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/out_2022-11-29-19-12-49.log
2022-11-29,19:12:54 | INFO | Rank 0 | logs: ../../pretrainedModel//experiments/
2022-11-29,19:12:54 | INFO | Rank 0 | lr: 5e-05
2022-11-29,19:12:54 | INFO | Rank 0 | max_epochs: 3
2022-11-29,19:12:54 | INFO | Rank 0 | max_steps: 5868
2022-11-29,19:12:54 | INFO | Rank 0 | name: muge_finetune_vit-b-16_roberta-base_bs128_8gpu
2022-11-29,19:12:54 | INFO | Rank 0 | num_workers: 4
2022-11-29,19:12:54 | INFO | Rank 0 | precision: amp
2022-11-29,19:12:54 | INFO | Rank 0 | rank: 0
2022-11-29,19:12:54 | INFO | Rank 0 | report_training_batch_acc: True
2022-11-29,19:12:54 | INFO | Rank 0 | reset_data_offset: True
2022-11-29,19:12:54 | INFO | Rank 0 | reset_optimizer: True
2022-11-29,19:12:54 | INFO | Rank 0 | resume: ../../pretrainedModel//pretrained_weights/clip_cn_rn50.pt
2022-11-29,19:12:54 | INFO | Rank 0 | save_epoch_frequency: 1
2022-11-29,19:12:54 | INFO | Rank 0 | save_step_frequency: 999999
2022-11-29,19:12:54 | INFO | Rank 0 | seed: 123
2022-11-29,19:12:54 | INFO | Rank 0 | skip_aggregate: False
2022-11-29,19:12:54 | INFO | Rank 0 | skip_scheduler: False
2022-11-29,19:12:54 | INFO | Rank 0 | text_model: RBT3-chinese
2022-11-29,19:12:54 | INFO | Rank 0 | train_data: ../../pretrainedModel//datasets/MUGE/lmdb/train
2022-11-29,19:12:54 | INFO | Rank 0 | use_augment: True
2022-11-29,19:12:54 | INFO | Rank 0 | use_bn_sync: False
2022-11-29,19:12:54 | INFO | Rank 0 | val_data: ../../pretrainedModel//datasets/MUGE/lmdb/valid
2022-11-29,19:12:54 | INFO | Rank 0 | valid_batch_size: 128
2022-11-29,19:12:54 | INFO | Rank 0 | valid_epoch_interval: 1
2022-11-29,19:12:54 | INFO | Rank 0 | valid_step_interval: 150
2022-11-29,19:12:54 | INFO | Rank 0 | vision_model: RN50
2022-11-29,19:12:54 | INFO | Rank 0 | warmup: 100
2022-11-29,19:12:54 | INFO | Rank 0 | wd: 0.001
2022-11-29,19:12:54 | INFO | Rank 0 | world_size: 1
2022-11-29,19:12:54 | INFO | Rank 0 | Use GPU: 0 for training
2022-11-29,19:12:54 | INFO | Rank 0 | => begin to load checkpoint '../../pretrainedModel//pretrained_weights/clip_cn_rn50.pt'
Traceback (most recent call last):
File "Chinese-CLIP/cn_clip/training/main.py", line 279, in <module>
main()
File "Chinese-CLIP/cn_clip/training/main.py", line 199, in main
checkpoint = torch.load(args.resume, map_location="cpu")
File "anaconda3/envs/py38/lib/python3.9/site-packages/torch/serialization.py", line 607, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "anaconda3/envs/py38/lib/python3.9/site-packages/torch/serialization.py", line 882, in _load
result = unpickler.load()
_pickle.UnpicklingError: invalid load key, '\xf7'.
Exception in thread ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 34447) of binary: anaconda3/envs/py38/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=localhost
master_port=8514
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_7g4dmrxl/none_xivgw557/attempt_1/0/error.json
@zhaoyinghe 收到,我们看下。另外也麻烦您校验下ckpt是否下载完整,请您执行下md5sum clip_cn_rn50.pt
,校验其MD5是否为b84a90869f59f0bf2e5cd53e1b7ce533
。如果不是的话,可能就是下载不完整的问题,请您重新下载ckpt后试试哈。
@zhaoyinghe 收到,我们看下。另外也麻烦您校验下ckpt是否下载完整,请您执行下
md5sum clip_cn_rn50.pt
,校验其MD5是否为b84a90869f59f0bf2e5cd53e1b7ce533
。如果不是的话,可能就是下载不完整的问题,请您重新下载ckpt后试试哈。
您好,md5确实不一样,我重试了几次都是一样,我这边是“9cd36409e2a01a026f1e5812b219ee11“。不过文中提供的四个模型(除了huge-size)checkpoint都是按照相同的方式下载和使用的,只有resnet-50的不行,您那边方便从md中resnet50对应的checkpoint链接下载看看md5编码是多少?
b84a90869f59f0bf2e5cd53e1b7ce533
这个就来自我们md提供的链接下载的文件哈
clip_cn_rn50.pt
两个链接的md5编码都对上了,可以跑起来了。之前的pt看来有点问题,虽然size一样。谢谢解答
torch==1.9.0 torchvision==0.10.0 lmdb==1.3.0 cuda version 10.2 上面是我的环境配置,我跑默认的clip_cn_vit-b-16.pt是可以finetuing,但是换成clip_cn_rn50.pt就失败了。下面是启动脚本中修改的地方。 checkpoint=clip_cn_rn50.pt vision_model=RN50 text_model=RBT3-chinese