Open HEasoner opened 1 month ago
It is hard to decide the exact error with your information. Can you try to run with single GPU and debug mode with the following command:
CUDA_LAUNCH_BLOCKING=1 python -u tools/train_detic.py --config-file projects/Detic/configs/ovd/Detic_V3Det-OVD-Base_IN_CLIP_R5021k_640b64_4x_ft4x_max-size.yaml --num-gpus 1
@yhcao6
I will get this report:
Traceback (most recent call last):
File "tools/train_detic.py", line 292, in
It seems an error about the dataset index error, but I didn't met that error before. By my experience there are two possible solutions:
DATASET_BS: [8, 32]
https://github.com/V3Det/Detectron2-V3Det/blob/7005952ea3a9eafea901b482f7fc8289b43de1cb/projects/Detic/configs/ovd/Detic_V3Det-OVD-Base_IN_CLIP_R5021k_640b64_4x_ft4x_max-size.yaml#L31NUM_WORKERS: 1
https://github.com/V3Det/Detectron2-V3Det/blob/7005952ea3a9eafea901b482f7fc8289b43de1cb/projects/Detic/configs/ovd/Detic_V3Det-OVD-Base_IN_CLIP_R5021k_640b64_4x_ft4x_max-size.yaml#L39
When I run this command: python -u tools/train_detic.py --config-file projects/Detic/configs/ovd/Detic_V3Det-OVD-Base_IN_CLIP_R5021k_640b64_4x_ft4x_max-size.yaml --num-gpus 4
It will work normally for a while but at some point it will report an error: Traceback (most recent call last): File "train_detic.py", line 292, in
launch(
File "/root/autodl-tmp/detectron2/engine/launch.py", line 67, in launch
mp.spawn(
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
root@autodl-container-52ea4dbaa0-6c92efdf:~/autodl-tmp# /root/miniconda3/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 128 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
What am I supposed to do?