Open adda1221 opened 1 year ago
how to solve it?
Hi, it seems that your experiment was not launched in the distributed mode. Specifically, the epoch
attribute is expected to be set here:
However, args.distributed
seems to be False
within your run. I would guess that you are running something like python main_pretrain.py ...
. You may try using torchrun
or other distributed launching commands instead. Here is a tutorial.
[08:15:21.504933] read dataset config from configs/data/pretrain/EN.yaml [08:15:21.513275] DATASET CONFIG: [08:15:21.513295] {'META': ['/HOME/llama-adapter/datasets/cc3m.csv']} [08:18:21.093524] /HOME/llama-adapter/datasets/cc3m.csv: len 3318333 [08:18:22.476513] total length: 3318333 [08:18:23.899807] <data.dataset.PretrainDataset object at 0x7f16d0076790> [08:18:23.899933] Sampler_train = <util.misc.DistributedSubEpochSampler object at 0x7f16d00760d0> [08:18:24.745975] Start training for 400 epochs [08:18:24.753625] log_dir: ./output Traceback (most recent call last): File "main_pretrain.py", line 202, in
main(args)
File "main_pretrain.py", line 171, in main
train_stats = train_one_epoch(
File "/HOME/llama-adapter/llama_adapter_v2_multimodal/engine_pretrain.py", line 31, in train_one_epoch
for data_iter_step, (examples, labels, example_mask, imgs) in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
File "/HOME/llama-adapter/llama_adapter_v2_multimodal/util/misc.py", line 149, in log_every
for obj in iterable:
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 441, in iter
return self._get_iterator()
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1084, in init
self._reset(loader, first_iter=True)
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset
self._try_put_index()
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index
index = self._next_index()
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 623, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 244, in iter
sampler_iter = iter(self.sampler)
File "/HOME/llama-adapter/llama_adapter_v2_multimodal/util/misc.py", line 380, in iter
g.manual_seed(self.seed + self.epoch // self.split_epoch)
AttributeError: 'DistributedSubEpochSampler' object has no attribute 'epoch'