V3Det / Detectron2-V3Det

Detectron2 Toolbox and Benchmark for V3Det
Apache License 2.0
15 stars 2 forks source link

Step 3 error #7

Open HEasoner opened 1 month ago

HEasoner commented 1 month ago

When I run this command: python -u tools/train_detic.py --config-file projects/Detic/configs/ovd/Detic_V3Det-OVD-Base_IN_CLIP_R5021k_640b64_4x_ft4x_max-size.yaml --num-gpus 4

It will work normally for a while but at some point it will report an error: Traceback (most recent call last): File "train_detic.py", line 292, in launch( File "/root/autodl-tmp/detectron2/engine/launch.py", line 67, in launch mp.spawn( File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 140, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV root@autodl-container-52ea4dbaa0-6c92efdf:~/autodl-tmp# /root/miniconda3/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 128 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

What am I supposed to do?

yhcao6 commented 1 month ago

It is hard to decide the exact error with your information. Can you try to run with single GPU and debug mode with the following command:

CUDA_LAUNCH_BLOCKING=1 python -u tools/train_detic.py --config-file projects/Detic/configs/ovd/Detic_V3Det-OVD-Base_IN_CLIP_R5021k_640b64_4x_ft4x_max-size.yaml --num-gpus 1

HEasoner commented 1 month ago

@yhcao6 I will get this report: Traceback (most recent call last): File "tools/train_detic.py", line 292, in launch( File "/root/autodl-tmp/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "tools/train_detic.py", line 271, in main do_train(cfg, model, resume=args.resume) File "tools/train_detic.py", line 172, in do_train for data, iteration in zip(data_loader, range(start_iter, max_iter)): File "/root/autodl-tmp/projects/Detic/detic/data/custom_dataset_dataloader.py", line 297, in iter for d in self.dataset: File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1313, in _next_data return self._process_data(data) File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/root/miniconda3/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise raise exception IndexError: Caught IndexError in DataLoader worker process 1. Original Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/autodl-tmp/detectron2/data/common.py", line 90, in getitem data = self._map_func(self._dataset[cur_idx]) File "/root/autodl-tmp/detectron2/data/common.py", line 154, in getitem start_addr = 0 if idx == 0 else self._addr[idx - 1].item() IndexError: index 139894584226735 is out of bounds for axis 0 with size 537492

yhcao6 commented 1 month ago

It seems an error about the dataset index error, but I didn't met that error before. By my experience there are two possible solutions:

  1. Decrease the batchsize, change this line to DATASET_BS: [8, 32] https://github.com/V3Det/Detectron2-V3Det/blob/7005952ea3a9eafea901b482f7fc8289b43de1cb/projects/Detic/configs/ovd/Detic_V3Det-OVD-Base_IN_CLIP_R5021k_640b64_4x_ft4x_max-size.yaml#L31
  2. Decrease dataloader num workers, change this line to NUM_WORKERS: 1 https://github.com/V3Det/Detectron2-V3Det/blob/7005952ea3a9eafea901b482f7fc8289b43de1cb/projects/Detic/configs/ovd/Detic_V3Det-OVD-Base_IN_CLIP_R5021k_640b64_4x_ft4x_max-size.yaml#L39