OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
MIT License
4.21k stars 439 forks source link

AttributeError: 'NoneType' object has no attribute 'get' #258

Open 5zjk5 opened 5 months ago

5zjk5 commented 5 months ago

大佬们这是为什么呀,跑步起来 版本

torch 1.13.1
cuda 11.7
torchvision 0.16.0
torchaudio 2.1.0

运行后报错

/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 681, in _initialize_workers
    worker_ids = self._start_workers(worker_group)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 271, in _start_workers
    self._pcontext = start_processes(
                     ^^^^^^^^^^^^^^^^
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/__init__.py", line 207, in start_processes
    redirs = to_map(redirects, nprocs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.virtualenvs/zjk_Chinese-CLIP-master/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 162, in to_map
    map[i] = val_or_map.get(i, Std.NONE)
             ^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'

以下是配置文件

#!/usr/bin/env

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training). 
# Please set the options below according to the comments. 
# For multi-gpu workers training, these options should be manually set for each worker. 
# After setting the options, please run the script on each worker.
# Command: bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}

# Number of GPUs per GPU worker
GPUS_PER_NODE=1
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=localhost
# The port for communication
export MASTER_PORT=8514
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=0 

export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip/

DATAPATH=${1}

# data options
train_data=${DATAPATH}/datasets/MUGE/lmdb/train
val_data=${DATAPATH}/datasets/MUGE/lmdb/valid # if val_data is not specified, the validation will be automatically disabled

# restore options
resume=${DATAPATH}/pretrained_weights/clip_cn_vit-b-16.pt # or specify your customed ckpt path to resume
reset_data_offset="--reset-data-offset"
reset_optimizer="--reset-optimizer"
# reset_optimizer=""

# output options
output_base_dir=${DATAPATH}/experiments/
name=muge_finetune_vit-b-16_roberta-base_bs128_8gpu
save_step_frequency=999999 # disable it
save_epoch_frequency=1
log_interval=1
report_training_batch_acc="--report-training-batch-acc"
# report_training_batch_acc=""

# training hyper-params
context_length=52
warmup=100
batch_size=128
valid_batch_size=128
accum_freq=1
lr=5e-5
wd=0.001
max_epochs=3 # or you can alternatively specify --max-steps
valid_step_interval=150
valid_epoch_interval=1
vision_model=ViT-B-16
text_model=RoBERTa-wwm-ext-base-chinese
use_augment="--use-augment"
# use_augment=""

python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
          --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
          --train-data=${train_data} \
          --val-data=${val_data} \
          --resume=${resume} \
          ${reset_data_offset} \
          ${reset_optimizer} \
          --logs=${output_base_dir} \
          --name=${name} \
          --save-step-frequency=${save_step_frequency} \
          --save-epoch-frequency=${save_epoch_frequency} \
          --log-interval=${log_interval} \
          ${report_training_batch_acc} \
          --context-length=${context_length} \
          --warmup=${warmup} \
          --batch-size=${batch_size} \
          --valid-batch-size=${valid_batch_size} \
          --valid-step-interval=${valid_step_interval} \
          --valid-epoch-interval=${valid_epoch_interval} \
          --accum-freq=${accum_freq} \
          --lr=${lr} \
          --wd=${wd} \
          --max-epochs=${max_epochs} \
          --vision-model=${vision_model} \
          ${use_augment} \
          --text-model=${text_model}
5zjk5 commented 5 months ago

在不改配置文件情况下,uninstall torch,在重新 pip install torch 后好像是可以跑了有日志产生,但换成了内存溢出错误: 重新 pip 后,torch 版本为 2.0.1

/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
/home/user/anaconda3/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/user/anaconda3/lib/python3.11/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda9SetDeviceEi'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Loading vision model config from /home/user/zjk/chinese_clip/cn_clip/clip/model_configs/ViT-B-16.json
Loading text model config from /home/user/zjk/chinese_clip/cn_clip/clip/model_configs/RoBERTa-wwm-ext-base-chinese.json
2024-02-27,11:27:53 | INFO | Rank 0 | train LMDB file contains 129380 images and 250314 pairs.
2024-02-27,11:27:53 | INFO | Rank 0 | val LMDB file contains 29806 images and 30588 pairs.
2024-02-27,11:27:53 | INFO | Rank 0 | Params:
2024-02-27,11:27:53 | INFO | Rank 0 |   accum_freq: 1
2024-02-27,11:27:53 | INFO | Rank 0 |   aggregate: True
2024-02-27,11:27:53 | INFO | Rank 0 |   batch_size: 128
2024-02-27,11:27:53 | INFO | Rank 0 |   bert_weight_path: None
2024-02-27,11:27:53 | INFO | Rank 0 |   beta1: 0.9
2024-02-27,11:27:53 | INFO | Rank 0 |   beta2: 0.98
2024-02-27,11:27:53 | INFO | Rank 0 |   checkpoint_path: datapath/experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/checkpoints
2024-02-27,11:27:53 | INFO | Rank 0 |   clip_weight_path: None
2024-02-27,11:27:53 | INFO | Rank 0 |   context_length: 52
2024-02-27,11:27:53 | INFO | Rank 0 |   debug: False
2024-02-27,11:27:53 | INFO | Rank 0 |   device: cuda:0
2024-02-27,11:27:53 | INFO | Rank 0 |   distllation: False
2024-02-27,11:27:53 | INFO | Rank 0 |   eps: 1e-06
2024-02-27,11:27:53 | INFO | Rank 0 |   freeze_vision: False
2024-02-27,11:27:53 | INFO | Rank 0 |   gather_with_grad: False
2024-02-27,11:27:53 | INFO | Rank 0 |   grad_checkpointing: False
2024-02-27,11:27:53 | INFO | Rank 0 |   kd_loss_weight: 0.5
2024-02-27,11:27:53 | INFO | Rank 0 |   local_device_rank: 0
2024-02-27,11:27:53 | INFO | Rank 0 |   log_interval: 1
2024-02-27,11:27:53 | INFO | Rank 0 |   log_level: 20
2024-02-27,11:27:53 | INFO | Rank 0 |   log_path: datapath/experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/out_2024-02-27-03-27-48.log
2024-02-27,11:27:53 | INFO | Rank 0 |   logs: datapath/experiments/
2024-02-27,11:27:53 | INFO | Rank 0 |   lr: 5e-05
2024-02-27,11:27:53 | INFO | Rank 0 |   mask_ratio: 0
2024-02-27,11:27:53 | INFO | Rank 0 |   max_epochs: 3
2024-02-27,11:27:53 | INFO | Rank 0 |   max_steps: 5868
2024-02-27,11:27:53 | INFO | Rank 0 |   name: muge_finetune_vit-b-16_roberta-base_bs128_8gpu
2024-02-27,11:27:53 | INFO | Rank 0 |   num_workers: 4
2024-02-27,11:27:53 | INFO | Rank 0 |   precision: amp
2024-02-27,11:27:53 | INFO | Rank 0 |   rank: 0
2024-02-27,11:27:53 | INFO | Rank 0 |   report_training_batch_acc: True
2024-02-27,11:27:53 | INFO | Rank 0 |   reset_data_offset: True
2024-02-27,11:27:53 | INFO | Rank 0 |   reset_optimizer: True
2024-02-27,11:27:53 | INFO | Rank 0 |   resume: datapath/pretrained_weights/clip_cn_vit-b-16.pt
2024-02-27,11:27:53 | INFO | Rank 0 |   save_epoch_frequency: 1
2024-02-27,11:27:53 | INFO | Rank 0 |   save_step_frequency: 999999
2024-02-27,11:27:53 | INFO | Rank 0 |   seed: 123
2024-02-27,11:27:53 | INFO | Rank 0 |   skip_aggregate: False
2024-02-27,11:27:53 | INFO | Rank 0 |   skip_scheduler: False
2024-02-27,11:27:53 | INFO | Rank 0 |   teacher_model_name: None
2024-02-27,11:27:53 | INFO | Rank 0 |   text_model: RoBERTa-wwm-ext-base-chinese
2024-02-27,11:27:53 | INFO | Rank 0 |   train_data: datapath/datasets/MUGE/lmdb/train
2024-02-27,11:27:53 | INFO | Rank 0 |   use_augment: True
2024-02-27,11:27:53 | INFO | Rank 0 |   use_bn_sync: False
2024-02-27,11:27:53 | INFO | Rank 0 |   use_flash_attention: False
2024-02-27,11:27:53 | INFO | Rank 0 |   val_data: datapath/datasets/MUGE/lmdb/valid
2024-02-27,11:27:53 | INFO | Rank 0 |   valid_batch_size: 128
2024-02-27,11:27:53 | INFO | Rank 0 |   valid_epoch_interval: 1
2024-02-27,11:27:53 | INFO | Rank 0 |   valid_num_workers: 1
2024-02-27,11:27:53 | INFO | Rank 0 |   valid_step_interval: 150
2024-02-27,11:27:53 | INFO | Rank 0 |   vision_model: ViT-B-16
2024-02-27,11:27:53 | INFO | Rank 0 |   warmup: 100
2024-02-27,11:27:53 | INFO | Rank 0 |   wd: 0.001
2024-02-27,11:27:53 | INFO | Rank 0 |   world_size: 1
2024-02-27,11:27:53 | INFO | Rank 0 | Use GPU: 0 for training
2024-02-27,11:27:53 | INFO | Rank 0 | => begin to load checkpoint 'datapath/pretrained_weights/clip_cn_vit-b-16.pt'
2024-02-27,11:27:54 | INFO | Rank 0 | => loaded checkpoint 'datapath/pretrained_weights/clip_cn_vit-b-16.pt' (epoch 15 @ 0 steps)
Traceback (most recent call last):
  File "/home/user/zjk/chinese_clip/cn_clip/training/main.py", line 350, in <module>
    main()
  File "/home/user/zjk/chinese_clip/cn_clip/training/main.py", line 298, in main
    num_steps_this_epoch = train(model, data, epoch, optimizer, scaler, scheduler, args, steps)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/zjk/chinese_clip/cn_clip/training/train.py", line 194, in train
    total_loss, acc = get_loss(model, images, texts, loss_img, loss_txt, args)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/zjk/chinese_clip/cn_clip/training/train.py", line 23, in get_loss
    image_features, text_features, logit_scale = model(images, texts, args.mask_ratio)
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 409, in forward
    image_features = self.encode_image(image, mask_ratio)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 394, in encode_image
    return self.visual(image.type(self.dtype), mask_ratio)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 279, in forward
    x = self.transformer(x)
        ^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 227, in forward
    return self.resblocks(x)
           ^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 209, in forward
    x = x + self.attention(self.ln_1(x))
                           ^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 176, in forward
    ret = super().forward(x.type(torch.float32))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/normalization.py", line 190, in forward
    return F.layer_norm(
           ^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/functional.py", line 2515, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB (GPU 0; 31.75 GiB total capacity; 4.28 GiB already allocated; 18.50 MiB free; 4.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception in thread Thread-1 (_monitor):
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20794) of binary: /home/user/.virtualenvs/zjk_Chinese-CLIP-master/bin/python3
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
cn_clip/training/main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-27_11:28:01
  host      : need09
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 20794)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
5zjk5 commented 5 months ago

内存溢出错误,去 linux 看,使用 nvidia-smi 查看,资源都是闲置的改小了 batch_size=32 都没用,看到一种说法:可能是之前运行的程序没有释放导致空间占据,释放内存 配置文件修改:

#!/usr/bin/env

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training). 
# Please set the options below according to the comments. 
# For multi-gpu workers training, these options should be manually set for each worker. 
# After setting the options, please run the script on each worker.
# Command: bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}

# Number of GPUs per GPU worker
GPUS_PER_NODE=1
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=localhost
# The port for communication
export MASTER_PORT=8514
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=6   ################### 原来是 0

现在没报错了,卡在开头了

5zjk5 commented 5 months ago
#!/usr/bin/env

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training). 
# Please set the options below according to the comments. 
# For multi-gpu workers training, these options should be manually set for each worker. 
# After setting the options, please run the script on each worker.
# Command: bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}

# Number of GPUs per GPU worker
GPUS_PER_NODE=1
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=localhost
# The port for communication
export MASTER_PORT=8514
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=0

这里先改回去,最后使用 nvitop 命令看到第 0 块 gpu 被人用了,所以一直报错内存溢出,看了训练代码 /train/main 中设置 gpu 的地方使用的是 int(os.environ["LOCAL_RANK"]) 来设置使用第几块 GPU ,我直接替换了一个空闲的就可以跑了

5zjk5 commented 5 months ago

模型训练得出了,现在特征提取,用提供的脚本,模型设置的地方需要理解一下,resume,vision-model 写了注释,要是不对欢迎提出

#!/usr/bin/env

cd /home/wanggang/zjk/chinese_clip/
export CUDA_VISIBLE_DEVICES=5
export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip

split=valid # 指定计算valid或test集特征
# 指定训练好的模型
resume=datapath/experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/checkpoints/epoch_latest.pt

# vision-model 指定训练好的模型是基于什么模型训练的
python -u cn_clip/eval/extract_features.py \
    --extract-image-feats \
    --extract-text-feats \
    --image-data="datapath/datasets/MUGE/lmdb/${split}/imgs" \
    --text-data="datapath/datasets/MUGE/${split}_texts.jsonl" \
    --img-batch-size=64 \
    --text-batch-size=64 \
    --context-length=52 \
    --resume=${resume} \
    --vision-model=ViT-B-16 \
    --text-model=RoBERTa-wwm-ext-base-chinese
xfxssr commented 5 months ago

非常好,感谢大佬,我这边也是同样的问题,看你的回复跑起来了