OpenGVLab / LLaMA-Adapter

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
GNU General Public License v3.0
5.73k stars 374 forks source link

error when pretrin the llama-adapterv2-multimodal #100

Open adda1221 opened 1 year ago

adda1221 commented 1 year ago

[08:15:21.504933] read dataset config from configs/data/pretrain/EN.yaml [08:15:21.513275] DATASET CONFIG: [08:15:21.513295] {'META': ['/HOME/llama-adapter/datasets/cc3m.csv']} [08:18:21.093524] /HOME/llama-adapter/datasets/cc3m.csv: len 3318333 [08:18:22.476513] total length: 3318333 [08:18:23.899807] <data.dataset.PretrainDataset object at 0x7f16d0076790> [08:18:23.899933] Sampler_train = <util.misc.DistributedSubEpochSampler object at 0x7f16d00760d0> [08:18:24.745975] Start training for 400 epochs [08:18:24.753625] log_dir: ./output Traceback (most recent call last): File "main_pretrain.py", line 202, in main(args) File "main_pretrain.py", line 171, in main train_stats = train_one_epoch( File "/HOME/llama-adapter/llama_adapter_v2_multimodal/engine_pretrain.py", line 31, in train_one_epoch for data_iter_step, (examples, labels, example_mask, imgs) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): File "/HOME/llama-adapter/llama_adapter_v2_multimodal/util/misc.py", line 149, in log_every for obj in iterable: File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 441, in iter return self._get_iterator() File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1084, in init self._reset(loader, first_iter=True) File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset self._try_put_index() File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index index = self._next_index() File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 623, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 244, in iter sampler_iter = iter(self.sampler) File "/HOME/llama-adapter/llama_adapter_v2_multimodal/util/misc.py", line 380, in iter g.manual_seed(self.seed + self.epoch // self.split_epoch) AttributeError: 'DistributedSubEpochSampler' object has no attribute 'epoch'

adda1221 commented 1 year ago

how to solve it?

ChrisLiu6 commented 1 year ago

Hi, it seems that your experiment was not launched in the distributed mode. Specifically, the epoch attribute is expected to be set here:

https://github.com/OpenGVLab/LLaMA-Adapter/blob/95b638997765af15036266f5acb5a4dd44b8ae96/llama_adapter_v2_multimodal7b/main_pretrain.py#L169

However, args.distributed seems to be False within your run. I would guess that you are running something like python main_pretrain.py .... You may try using torchrun or other distributed launching commands instead. Here is a tutorial.