Vision-CAIR / MiniGPT-4

Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
https://minigpt-4.github.io
BSD 3-Clause "New" or "Revised" License
25.4k stars 2.91k forks source link

finetune get error data = next(self.iter_loader) StopIteration #342

Open 631068264 opened 1 year ago

631068264 commented 1 year ago

Describe the bug Follow this doc , prepare finetune data

cc_sbu_align.zip just choice 2 to 14 jpgto train and translate the caption to Chinese.

error log

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
| distributed init (rank 0, world 1): env://
2023-09-04 15:52:12,966 [INFO] 
=====  Running Parameters    =====
2023-09-04 15:52:12,966 [INFO] {
    "amp": true,
    "batch_size_eval": 12,
    "batch_size_train": 12,
    "device": "cuda",
    "dist_backend": "nccl",
    "dist_url": "env://",
    "distributed": true,
    "evaluate": false,
    "gpu": 0,
    "init_lr": 3e-05,
    "iters_per_epoch": 200,
    "lr_sched": "linear_warmup_cosine_lr",
    "max_epoch": 5,
    "min_lr": 1e-05,
    "num_workers": 4,
    "output_dir": "output/minigpt4_stage2_finetune",
    "rank": 0,
    "resume_ckpt_path": null,
    "seed": 42,
    "task": "image_text_pretrain",
    "train_splits": [
        "train"
    ],
    "warmup_lr": 1e-06,
    "warmup_steps": 200,
    "weight_decay": 0.05,
    "world_size": 1
}
2023-09-04 15:52:12,966 [INFO] 
======  Dataset Attributes  ======
2023-09-04 15:52:12,967 [INFO] 
======== cc_sbu_align =======
2023-09-04 15:52:12,967 [INFO] {
    "build_info": {
        "storage": "/data/home/yaokj5/dl/apps/MiniGPT-4/cc_sbu_align"
    },
    "data_type": "images",
    "text_processor": {
        "train": {
            "name": "blip_caption"
        }
    },
    "vis_processor": {
        "train": {
            "image_size": 224,
            "name": "blip2_image_train"
        }
    }
}
2023-09-04 15:52:12,967 [INFO] 
======  Model Attributes  ======
2023-09-04 15:52:12,967 [INFO] {
    "arch": "mini_gpt4",
    "ckpt": "/data/home/yaokj5/dl/apps/MiniGPT-4/ckpt/pretrained_minigpt4_llama2_7b.pth",
    "drop_path_rate": 0,
    "end_sym": "</s>",
    "freeze_vit": true,
    "has_qformer": false,
    "image_size": 224,
    "llama_model": "/data/home/yaokj5/dl/models/Llama-2-7b-chat-hf",
    "max_txt_len": 160,
    "model_type": "pretrain_llama2",
    "prompt": "",
    "prompt_path": "prompts/alignment.txt",
    "prompt_template": "[INST] {} [/INST] ",
    "use_grad_checkpoint": false,
    "vit_precision": "fp16"
}
2023-09-04 15:52:12,967 [INFO] Building datasets...
Loading VIT
2023-09-04 15:52:35,566 [INFO] freeze vision encoder
Loading VIT Done
Do not use Q-Former here.
Loading LLAMA
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.62s/it]
Loading LLAMA Done
Load 4 training prompts
Prompt Example 
[INST] <Img><ImageHere></Img> Describe this image in detail. [/INST] 
Load BLIP2-LLM Checkpoint: /data/home/yaokj5/dl/apps/MiniGPT-4/ckpt/pretrained_minigpt4_llama2_7b.pth
2023-09-04 15:54:50,423 [INFO] Start training
2023-09-04 15:54:56,278 [INFO] dataset_ratios not specified, datasets will be concatenated (map-style datasets) or chained (webdataset.DataPipeline).
2023-09-04 15:54:56,278 [INFO] Loaded 10 records for train split from the dataset.
module.llama_proj.weight
module.llama_proj.bias
2023-09-04 15:54:56,287 [INFO] number of trainable parameters: 23072768
2023-09-04 15:54:56,287 [INFO] Start training epoch 0, 200 iters per inner epoch.
Traceback (most recent call last):
  File "/data/home/yaokj5/dl/apps/MiniGPT-4/minigpt4/datasets/datasets/dataloader_utils.py", line 147, in __next__
    data = next(self.iter_loader)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/home/yaokj5/dl/apps/MiniGPT-4/train.py", line 103, in <module>
    main()
  File "/data/home/yaokj5/dl/apps/MiniGPT-4/train.py", line 99, in main
    runner.train()
  File "/data/home/yaokj5/dl/apps/MiniGPT-4/minigpt4/runners/runner_base.py", line 378, in train
    train_stats = self.train_epoch(cur_epoch)
  File "/data/home/yaokj5/dl/apps/MiniGPT-4/minigpt4/runners/runner_base.py", line 438, in train_epoch
    return self.task.train_epoch(
  File "/data/home/yaokj5/dl/apps/MiniGPT-4/minigpt4/tasks/base_task.py", line 114, in train_epoch
    return self._train_inner_loop(
  File "/data/home/yaokj5/dl/apps/MiniGPT-4/minigpt4/tasks/base_task.py", line 205, in _train_inner_loop
    samples = next(data_loader)
  File "/data/home/yaokj5/dl/apps/MiniGPT-4/minigpt4/datasets/datasets/dataloader_utils.py", line 43, in __next__
    return next(self.loaders[loader_idx])
  File "/data/home/yaokj5/dl/apps/MiniGPT-4/minigpt4/datasets/datasets/dataloader_utils.py", line 154, in __next__
    data = next(self.iter_loader)
StopIteration
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 25566) of binary: /data/home/yaokj5/anaconda3/envs/minigpt4/bin/python
Traceback (most recent call last):
  File "/data/home/yaokj5/anaconda3/envs/minigpt4/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data/home/yaokj5/anaconda3/envs/minigpt4/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/minigpt4/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/data/home/yaokj5/anaconda3/envs/minigpt4/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/data/home/yaokj5/anaconda3/envs/minigpt4/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/home/yaokj5/anaconda3/envs/minigpt4/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-04_15:55:01
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 25566)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
1429904852 commented 10 months ago

Hi 631068264, how did you solve this bug?

ChouGS commented 7 months ago

Hi all,

I had the same issue as yours and eventually pinpointed the problem in one of my custom datasets. The __getitem__ function of my custom dataset class reads a piece of shared content from the SAME file as part of the prompt, which looks like:

with open('path/to/a/single/system/message/file, 'r') as f:
    instruction = f.read()

This read operation can cause race and deadlock in high-concurrency scenarios, which is one of the possible causes to this issue. Writing the content as a static string instead of reading it from file solved my problem. Hope it be of help to yours :)

ChouGS commented 7 months ago

At further exploration I could steadily reproduce this bug with using the ok_vqa dataset configurated here. Simply comment this block to exclude the dataset from finetuning and everything shall be OK.