NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
1.02k stars 287 forks source link

Error while running program train.py from train2 #301

Closed Avi241 closed 1 year ago

Avi241 commented 1 year ago

I have my own custom dataset generated with nvisii and I want to perform training on it. I have setuped the conda environment will all the packages in requirements.txt. but when I run the training scrips I am getting the follwing error. FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects--local-rankargument to be set, please change it to read fromos.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( Prpgtama strated start: 17:58:02.134134 usage: train.py [-h] [--data DATA [DATA ...]] [--datatest DATATEST [DATATEST ...]] [--testonly] [--testbatchsize TESTBATCHSIZE] [--freeze] [--test] [--debug] [--savetest] [--horovod] [--objects OBJECTS [OBJECTS ...]] [--optimizer OPTIMIZER] [--workers WORKERS] [--batchsize BATCHSIZE] [--imagesize IMAGESIZE] [--lr LR] [--noise NOISE] [--net NET] [--net_dope NET_DOPE] [--network NETWORK] [--namefile NAMEFILE] [--manualseed MANUALSEED] [--epochs EPOCHS] [--loginterval LOGINTERVAL] [--gpuids GPUIDS [GPUIDS ...]] [--extensions EXTENSIONS [EXTENSIONS ...]] [--outf OUTF] [--sigma SIGMA] [--keypoints KEYPOINTS] [--no_affinity NO_AFFINITY] [--datastyle DATASTYLE] [--save] [--verbose] [--dontsave] [--pretrained PRETRAINED] [--features FEATURES] [--datasize DATASIZE] [--nbupdates NBUPDATES] [--data1 DATA1] [--data2 DATA2] [--size1 SIZE1] [--size2 SIZE2] [--local_rank LOCAL_RANK] [--option OPTION] train.py: error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 396522) of binary: /home/tower/anaconda3/envs/train_dope_2/bin/python Traceback (most recent call last): File "/home/tower/anaconda3/envs/train_dope_2/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/tower/anaconda3/envs/train_dope_2/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/tower/anaconda3/envs/train_dope_2/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in main() File "/home/tower/anaconda3/envs/train_dope_2/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/tower/anaconda3/envs/train_dope_2/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/tower/anaconda3/envs/train_dope_2/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/tower/anaconda3/envs/train_dope_2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/tower/anaconda3/envs/train_dope_2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-06-02_17:58:05 host : 01hw2317000 rank : 0 (local_rank: 0) exitcode : 2 (pid: 396522) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html` please help if someone knows about this problem.. Thank you
TontonTremblay commented 1 year ago

This looks like it is due to pytorch distribute, which version of pytorch are you using? maybe install 1.4 and try again. I did not test this code with 2.0 >

Avi241 commented 1 year ago

Yes it is a Pytorch version issue. I was using 2.1.0. With pytroch>=2.0 it was throwing that previous error . Then I created a new conda environment with the pytorch version 1.6.0 but my GPU doesn't support it. When I run torch.cuda.get_device_name() it gives me and warning of UserWarning: NVIDIA RTX A6000 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37. If you want to use the NVIDIA RTX A6000 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/ warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) 'NVIDIA RTX A6000'

An when I run the training code it throws an error of Variable._execution_engine.run_backward( RuntimeError: no valid convolution algorithms available in CuDNN

Avi241 commented 1 year ago

The issue is solved. It needs a proper combination of Pytorch and Cuda For me, it's working with Pytorch=1.7.0 and Cuda = 11.0

TontonTremblay commented 1 year ago

Thank you so much, I will add it to the readme, and hopefully sometimes in the future I will update the code to work with the newest version of pytorch.

On Thu, Jun 8, 2023 at 3:01 AM Arvind Pandit @.***> wrote:

Closed #301 https://github.com/NVlabs/Deep_Object_Pose/issues/301 as completed.

— Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/301#event-9470336399, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JIGQAZPWYXBUP5YFNNLXKGPGXANCNFSM6AAAAAAYYIHSAE . You are receiving this because you commented.Message ID: @.***>

TontonTremblay commented 1 year ago

https://github.com/NVlabs/Deep_Object_Pose/tree/master/scripts#readme