HieuPhan33 / REMINDER

Class Similarity Weighted Knowledge Distillation for Continual Semantic Segmentation
GNU General Public License v3.0
16 stars 5 forks source link

How to address this issue? #11

Open jawais opened 10 months ago

jawais commented 10 months ago

(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$ /bin/bash /media/tao/新加卷/Osman/REMINDER-main/train_voc_19-1.sh voc_19-1_REMINDER On GPUs 0,1 Writing in results/2023-12-24_voc_19-1_REMINDER.csv Begin training! Begin training! Learning for 1 with lrs=[0.01]. Learning for 1 with lrs=[0.01]. Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten. /home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use PyTorch AMP Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'") /home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/init.py:68: DeprecatedFeatureWarning: apex.parallel.DistributedDataParallel is deprecated and will be removed by the end of February 2023. Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten. Traceback (most recent call last): File "run.py", line 587, in main(opts) File "run.py", line 158, in main Traceback (most recent call last): File "run.py", line 587, in val_score = run_step(opts, world_size, rank, device) File "run.py", line 299, in run_step model = DistributedDataParallel(model, delay_allreduce=True) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init main(opts) File "run.py", line 158, in main File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call val_score = run_step(opts, world_size, rank, device) File "run.py", line 299, in run_step File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast model = DistributedDataParallel(model, delay_allreduce=True) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 257, in init work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 65000 File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 77, in flat_dist_call File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/parallel/distributed.py", line 43, in apply_flat_dist_call File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552411/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 65000 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15679) of binary: /home/tao/anaconda3/envs/pytorch/bin/python3 Traceback (most recent call last): File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in main() File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/tao/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures: [1]: time : 2023-12-24_19:50:32 host : tao rank : 1 (local_rank: 1) exitcode : 1 (pid: 15680) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-12-24_19:50:32 host : tao rank : 0 (local_rank: 0) exitcode : 1 (pid: 15679) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(pytorch) tao@tao:/media/tao/新加卷/Osman/REMINDER-main$

HieuPhan33 commented 10 months ago

Hi,

Could you please check if you have two GPUs in your machine with this command:

import torch

print(torch.cuda.device_count())  # Number of CUDA devices
jawais commented 10 months ago

Hi,

I have two GPU's

this is the output:

image

HieuPhan33 commented 10 months ago

Hi,

We have created a docker image at stevephan46/reminder:latest. Use this command to create a docker container with pre-installed environment:

docker run --name reminder -it --gpus all --shm-size=4g stevephan46/reminder:latest /bin/bash

This will create a docker container with working environments required to run the code.

Could you please try setting up the docker container and running the script?

Let me know if there's any issue.