My setup is having 8 Titan X GPUs, when i tried to set --ref 32 it gives this error,
/var/spool/slurm/slurmd/job86812/slurm_script: line 50: $benchmarch_logs: ambiguous redirect
Traceback (most recent call last):
File "/home/mu480317/ODISE/./tools/train_net.py", line 392, in
launch(
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 67, in launch
mp.spawn(
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 5 terminated with the following error:
Traceback (most recent call last):
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, args)
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker
main_func(args)
File "/home/mu480317/ODISE/tools/train_net.py", line 319, in main
cfg = auto_scale_workers(cfg, comm.get_world_size())
File "/home/mu480317/ODISE/odise/config/utils.py", line 65, in auto_scale_workers
assert cfg.dataloader.train.total_batch_size % old_world_size == 0, (
AssertionError: Invalid reference_world_size in config! 8 % 32 != 0
When --ref 8 , then the GPU memory is overflowing.
My setup is having 8 Titan X GPUs, when i tried to set --ref 32 it gives this error,
/var/spool/slurm/slurmd/job86812/slurm_script: line 50: $benchmarch_logs: ambiguous redirect Traceback (most recent call last): File "/home/mu480317/ODISE/./tools/train_net.py", line 392, in
launch(
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 67, in launch
mp.spawn(
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 5 terminated with the following error: Traceback (most recent call last): File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker main_func(args) File "/home/mu480317/ODISE/tools/train_net.py", line 319, in main cfg = auto_scale_workers(cfg, comm.get_world_size()) File "/home/mu480317/ODISE/odise/config/utils.py", line 65, in auto_scale_workers assert cfg.dataloader.train.total_batch_size % old_world_size == 0, ( AssertionError: Invalid reference_world_size in config! 8 % 32 != 0
When --ref 8 , then the GPU memory is overflowing.
Please help me solve this. Thank you