jawais commented 10 months ago

(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$ /bin/bash /media/tao/新加卷/Osman/REMINDER-main/train_voc_19-1.sh voc_19-1_REMINDER On GPUs 0,1 Writing in results/2023-12-24_voc_19-1_REMINDER.csv Begin training! Begin training! Learning for 1 with lrs=[0.01]. Learning for 1 with lrs=[0.01]. Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten. /home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use PyTorch AMP Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'") /home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.parallel.DistributedDataParallel is deprecated and will be removed by the end of February 2023. Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten. Traceback (most recent call last): File "run.py", line 587, in main(opts) File "run.py", line 158, in main Traceback (most recent call last): File "run.py", line 587, in val_score = run_step(opts, world_size, rank, device) File "run.py", line 299, in run_step model = DistributedDataParallel(model, delay_allreduce=True) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init main(opts) File "run.py", line 158, in main File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call val_score = run_step(opts, world_size, rank, device) File "run.py", line 299, in run_step File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast model = DistributedDataParallel(model, delay_allreduce=True) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 65000 File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 65000 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15679) of binary: /home/tao/anaconda3/envs/pytorch/bin/python3 Traceback (most recent call last): File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in main() File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures: [1]: time : 2023-12-24_19:50:32 host : tao rank : 1 (local_rank: 1) exitcode : 1 (pid: 15680) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-12-24_19:50:32 host : tao rank : 0 (local_rank: 0) exitcode : 1 (pid: 15679) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$

HieuPhan33 commented 10 months ago

Hi,

Could you please check if you have two GPUs in your machine with this command:

import torch

print(torch.cuda.device_count())  # Number of CUDA devices

jawais commented 10 months ago

Hi,

I have two GPU's

this is the output:

HieuPhan33 commented 10 months ago

Hi,

We have created a docker image at stevephan46/reminder:latest. Use this command to create a docker container with pre-installed environment:

docker run --name reminder -it --gpus all --shm-size=4g stevephan46/reminder:latest /bin/bash

This will create a docker container with working environments required to run the code.

Could you please try setting up the docker container and running the script?

Let me know if there's any issue.

HieuPhan33 / REMINDER

How to address this issue? #11

run.py FAILED

Failures: [1]: time : 2023-12-24_19:50:32 host : tao rank : 1 (local_rank: 1) exitcode : 1 (pid: 15680) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-12-24_19:50:32 host : tao rank : 0 (local_rank: 0) exitcode : 1 (pid: 15679) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html