Open jawais opened 10 months ago
Hi,
Could you please check if you have two GPUs in your machine with this command:
import torch
print(torch.cuda.device_count()) # Number of CUDA devices
Hi,
I have two GPU's
this is the output:
Hi,
We have created a docker image at stevephan46/reminder:latest
.
Use this command to create a docker container with pre-installed environment:
docker run --name reminder -it --gpus all --shm-size=4g stevephan46/reminder:latest /bin/bash
This will create a docker container with working environments required to run the code.
Could you please try setting up the docker container and running the script?
Let me know if there's any issue.
(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$ /bin/bash /media/tao/新加卷/Osman/REMINDER-main/train_voc_19-1.sh voc_19-1_REMINDER On GPUs 0,1 Writing in results/2023-12-24_voc_19-1_REMINDER.csv Begin training! Begin training! Learning for 1 with lrs=[0.01]. Learning for 1 with lrs=[0.01]. Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten. /home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use PyTorch AMP Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'") /home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.parallel.DistributedDataParallel is deprecated and will be removed by the end of February 2023. Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten. Traceback (most recent call last): File "run.py", line 587, in
main(opts)
File "run.py", line 158, in main
Traceback (most recent call last):
File "run.py", line 587, in
val_score = run_step(opts, world_size, rank, device)
File "run.py", line 299, in run_step
model = DistributedDataParallel(model, delay_allreduce=True)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init
main(opts)
File "run.py", line 158, in main
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call
val_score = run_step(opts, world_size, rank, device)
File "run.py", line 299, in run_step
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
model = DistributedDataParallel(model, delay_allreduce=True)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 65000
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 65000
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15679) of binary: /home/tao/anaconda3/envs/pytorch/bin/python3
Traceback (most recent call last):
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in
main()
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run.py FAILED
Failures: [1]: time : 2023-12-24_19:50:32 host : tao rank : 1 (local_rank: 1) exitcode : 1 (pid: 15680) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-12-24_19:50:32 host : tao rank : 0 (local_rank: 0) exitcode : 1 (pid: 15679) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$