Open erlan-11 opened 3 months ago
应该是参数不正确 GPUS_PER_NODE=1 # 每个机器上的GPU个数 WORKER_CNT=1 # 训练的机器个数,Number of GPU workers, for single-worker training, please set to 1
export RANK=0 # The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
检查RANK和WORKER_CNT参数,要保证RANK的值比WORKER_CNT小
@ChesonHuang 我也报错这个 但是我已经设置了
主要是数据集分割的问题,已解决
@erlan-11 可以说一下在哪里处理的嘛?因为我现在还是遇到该问题
@erlan-11 我认为是tap分割的问题,但是我查看了数据,key和data是一一对应的,并没有出现空格这种NoneType,所以我不知道哪里出现了问题,如果可以能跟我说一下怎么处理嘛?谢谢!
@erlan-11
我出现的问题是在数据集分割时,image_id在text中找不到对应的描述,你可以写一个小的脚本看一下是否也出现这个情况
Traceback (most recent call last): File "/root/Chinese-CLIP/cn_clip/training/main.py", line 350, in
main()
File "/root/Chinese-CLIP/cn_clip/training/main.py", line 298, in main
num_steps_this_epoch = train(model, data, epoch, optimizer, scaler, scheduler, args, steps)
File "/root/Chinese-CLIP/cn_clip/training/train.py", line 165, in train
batch = next(data_iter)
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
raise exception
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/Taidi/Chinese-CLIP/cn_clip/training/data.py", line 109, in getitem
image_b64 = self.txn_imgs.get("{}".format(image_id).encode('utf-8')).tobytes()
AttributeError: 'NoneType' object has no attribute 'tobytes'
Exception in thread [2024-04-08 00:26:44,250] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 114557) of binary: /root/miniconda3/envs/ML/bin/python3 Traceback (most recent call last): File "/root/miniconda3/envs/ML/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/ML/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/ML/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
cn_clip/training/main.py FAILED
Failures: