Open 5zjk5 opened 9 months ago
在不改配置文件情况下,uninstall torch,在重新 pip install torch 后好像是可以跑了有日志产生,但换成了内存溢出错误: 重新 pip 后,torch 版本为 2.0.1
/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
/home/user/anaconda3/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/user/anaconda3/lib/python3.11/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda9SetDeviceEi'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
Loading vision model config from /home/user/zjk/chinese_clip/cn_clip/clip/model_configs/ViT-B-16.json
Loading text model config from /home/user/zjk/chinese_clip/cn_clip/clip/model_configs/RoBERTa-wwm-ext-base-chinese.json
2024-02-27,11:27:53 | INFO | Rank 0 | train LMDB file contains 129380 images and 250314 pairs.
2024-02-27,11:27:53 | INFO | Rank 0 | val LMDB file contains 29806 images and 30588 pairs.
2024-02-27,11:27:53 | INFO | Rank 0 | Params:
2024-02-27,11:27:53 | INFO | Rank 0 | accum_freq: 1
2024-02-27,11:27:53 | INFO | Rank 0 | aggregate: True
2024-02-27,11:27:53 | INFO | Rank 0 | batch_size: 128
2024-02-27,11:27:53 | INFO | Rank 0 | bert_weight_path: None
2024-02-27,11:27:53 | INFO | Rank 0 | beta1: 0.9
2024-02-27,11:27:53 | INFO | Rank 0 | beta2: 0.98
2024-02-27,11:27:53 | INFO | Rank 0 | checkpoint_path: datapath/experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/checkpoints
2024-02-27,11:27:53 | INFO | Rank 0 | clip_weight_path: None
2024-02-27,11:27:53 | INFO | Rank 0 | context_length: 52
2024-02-27,11:27:53 | INFO | Rank 0 | debug: False
2024-02-27,11:27:53 | INFO | Rank 0 | device: cuda:0
2024-02-27,11:27:53 | INFO | Rank 0 | distllation: False
2024-02-27,11:27:53 | INFO | Rank 0 | eps: 1e-06
2024-02-27,11:27:53 | INFO | Rank 0 | freeze_vision: False
2024-02-27,11:27:53 | INFO | Rank 0 | gather_with_grad: False
2024-02-27,11:27:53 | INFO | Rank 0 | grad_checkpointing: False
2024-02-27,11:27:53 | INFO | Rank 0 | kd_loss_weight: 0.5
2024-02-27,11:27:53 | INFO | Rank 0 | local_device_rank: 0
2024-02-27,11:27:53 | INFO | Rank 0 | log_interval: 1
2024-02-27,11:27:53 | INFO | Rank 0 | log_level: 20
2024-02-27,11:27:53 | INFO | Rank 0 | log_path: datapath/experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/out_2024-02-27-03-27-48.log
2024-02-27,11:27:53 | INFO | Rank 0 | logs: datapath/experiments/
2024-02-27,11:27:53 | INFO | Rank 0 | lr: 5e-05
2024-02-27,11:27:53 | INFO | Rank 0 | mask_ratio: 0
2024-02-27,11:27:53 | INFO | Rank 0 | max_epochs: 3
2024-02-27,11:27:53 | INFO | Rank 0 | max_steps: 5868
2024-02-27,11:27:53 | INFO | Rank 0 | name: muge_finetune_vit-b-16_roberta-base_bs128_8gpu
2024-02-27,11:27:53 | INFO | Rank 0 | num_workers: 4
2024-02-27,11:27:53 | INFO | Rank 0 | precision: amp
2024-02-27,11:27:53 | INFO | Rank 0 | rank: 0
2024-02-27,11:27:53 | INFO | Rank 0 | report_training_batch_acc: True
2024-02-27,11:27:53 | INFO | Rank 0 | reset_data_offset: True
2024-02-27,11:27:53 | INFO | Rank 0 | reset_optimizer: True
2024-02-27,11:27:53 | INFO | Rank 0 | resume: datapath/pretrained_weights/clip_cn_vit-b-16.pt
2024-02-27,11:27:53 | INFO | Rank 0 | save_epoch_frequency: 1
2024-02-27,11:27:53 | INFO | Rank 0 | save_step_frequency: 999999
2024-02-27,11:27:53 | INFO | Rank 0 | seed: 123
2024-02-27,11:27:53 | INFO | Rank 0 | skip_aggregate: False
2024-02-27,11:27:53 | INFO | Rank 0 | skip_scheduler: False
2024-02-27,11:27:53 | INFO | Rank 0 | teacher_model_name: None
2024-02-27,11:27:53 | INFO | Rank 0 | text_model: RoBERTa-wwm-ext-base-chinese
2024-02-27,11:27:53 | INFO | Rank 0 | train_data: datapath/datasets/MUGE/lmdb/train
2024-02-27,11:27:53 | INFO | Rank 0 | use_augment: True
2024-02-27,11:27:53 | INFO | Rank 0 | use_bn_sync: False
2024-02-27,11:27:53 | INFO | Rank 0 | use_flash_attention: False
2024-02-27,11:27:53 | INFO | Rank 0 | val_data: datapath/datasets/MUGE/lmdb/valid
2024-02-27,11:27:53 | INFO | Rank 0 | valid_batch_size: 128
2024-02-27,11:27:53 | INFO | Rank 0 | valid_epoch_interval: 1
2024-02-27,11:27:53 | INFO | Rank 0 | valid_num_workers: 1
2024-02-27,11:27:53 | INFO | Rank 0 | valid_step_interval: 150
2024-02-27,11:27:53 | INFO | Rank 0 | vision_model: ViT-B-16
2024-02-27,11:27:53 | INFO | Rank 0 | warmup: 100
2024-02-27,11:27:53 | INFO | Rank 0 | wd: 0.001
2024-02-27,11:27:53 | INFO | Rank 0 | world_size: 1
2024-02-27,11:27:53 | INFO | Rank 0 | Use GPU: 0 for training
2024-02-27,11:27:53 | INFO | Rank 0 | => begin to load checkpoint 'datapath/pretrained_weights/clip_cn_vit-b-16.pt'
2024-02-27,11:27:54 | INFO | Rank 0 | => loaded checkpoint 'datapath/pretrained_weights/clip_cn_vit-b-16.pt' (epoch 15 @ 0 steps)
Traceback (most recent call last):
File "/home/user/zjk/chinese_clip/cn_clip/training/main.py", line 350, in <module>
main()
File "/home/user/zjk/chinese_clip/cn_clip/training/main.py", line 298, in main
num_steps_this_epoch = train(model, data, epoch, optimizer, scaler, scheduler, args, steps)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/zjk/chinese_clip/cn_clip/training/train.py", line 194, in train
total_loss, acc = get_loss(model, images, texts, loss_img, loss_txt, args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/zjk/chinese_clip/cn_clip/training/train.py", line 23, in get_loss
image_features, text_features, logit_scale = model(images, texts, args.mask_ratio)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 409, in forward
image_features = self.encode_image(image, mask_ratio)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 394, in encode_image
return self.visual(image.type(self.dtype), mask_ratio)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 279, in forward
x = self.transformer(x)
^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 227, in forward
return self.resblocks(x)
^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 209, in forward
x = x + self.attention(self.ln_1(x))
^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/zjk/chinese_clip/cn_clip/clip/model.py", line 176, in forward
ret = super().forward(x.type(torch.float32))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/modules/normalization.py", line 190, in forward
return F.layer_norm(
^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/nn/functional.py", line 2515, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB (GPU 0; 31.75 GiB total capacity; 4.28 GiB already allocated; 18.50 MiB free; 4.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception in thread Thread-1 (_monitor):
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20794) of binary: /home/user/.virtualenvs/zjk_Chinese-CLIP-master/bin/python3
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
cn_clip/training/main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-27_11:28:01
host : need09
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 20794)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
内存溢出错误,去 linux 看,使用 nvidia-smi 查看,资源都是闲置的改小了 batch_size=32 都没用,看到一种说法:可能是之前运行的程序没有释放导致空间占据,释放内存 配置文件修改:
#!/usr/bin/env
# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training).
# Please set the options below according to the comments.
# For multi-gpu workers training, these options should be manually set for each worker.
# After setting the options, please run the script on each worker.
# Command: bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}
# Number of GPUs per GPU worker
GPUS_PER_NODE=1
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=localhost
# The port for communication
export MASTER_PORT=8514
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=6 ################### 原来是 0
现在没报错了,卡在开头了
#!/usr/bin/env
# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training).
# Please set the options below according to the comments.
# For multi-gpu workers training, these options should be manually set for each worker.
# After setting the options, please run the script on each worker.
# Command: bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}
# Number of GPUs per GPU worker
GPUS_PER_NODE=1
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=localhost
# The port for communication
export MASTER_PORT=8514
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=0
这里先改回去,最后使用 nvitop 命令看到第 0 块 gpu 被人用了,所以一直报错内存溢出,看了训练代码 /train/main 中设置 gpu 的地方使用的是 int(os.environ["LOCAL_RANK"]) 来设置使用第几块 GPU ,我直接替换了一个空闲的就可以跑了
模型训练得出了,现在特征提取,用提供的脚本,模型设置的地方需要理解一下,resume,vision-model 写了注释,要是不对欢迎提出
#!/usr/bin/env
cd /home/wanggang/zjk/chinese_clip/
export CUDA_VISIBLE_DEVICES=5
export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip
split=valid # 指定计算valid或test集特征
# 指定训练好的模型
resume=datapath/experiments/muge_finetune_vit-b-16_roberta-base_bs128_8gpu/checkpoints/epoch_latest.pt
# vision-model 指定训练好的模型是基于什么模型训练的
python -u cn_clip/eval/extract_features.py \
--extract-image-feats \
--extract-text-feats \
--image-data="datapath/datasets/MUGE/lmdb/${split}/imgs" \
--text-data="datapath/datasets/MUGE/${split}_texts.jsonl" \
--img-batch-size=64 \
--text-batch-size=64 \
--context-length=52 \
--resume=${resume} \
--vision-model=ViT-B-16 \
--text-model=RoBERTa-wwm-ext-base-chinese
非常好,感谢大佬,我这边也是同样的问题,看你的回复跑起来了
大佬们这是为什么呀,跑步起来 版本
运行后报错
以下是配置文件