ShiqiYu / OpenGait

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.
665 stars 154 forks source link

Problem with Trainning #150

Closed CKPPY closed 2 months ago

CKPPY commented 10 months ago

Traceback (most recent call last): File "/home/dt/anaconda3/envs/opengait0/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/dt/anaconda3/envs/opengait0/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/dt/anaconda3/envs/opengait0/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in main() File "/home/dt/anaconda3/envs/opengait0/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/dt/anaconda3/envs/opengait0/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/dt/anaconda3/envs/opengait0/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/dt/anaconda3/envs/opengait0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/dt/anaconda3/envs/opengait0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

opengait/main.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-16_15:24:09 host : dt-System-Product-Name rank : 0 (local_rank: 0) exitcode : 2 (pid: 2828) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
CKPPY commented 10 months ago

Solved

HADDYIZE commented 10 months ago

Hello, I have also encountered this issue recently. May I ask how you resolved it? Thank you.

HADDYIZE commented 10 months ago

Solved

May I ask how you resolved this issue? Thank you!

CKPPY commented 10 months ago

可以给我看看你完整的报错信息吗

---Original--- From: @.> Date: Mon, Aug 21, 2023 11:07 AM To: @.>; Cc: @.**@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150)

Solved

May I ask how you resolved this issue? Thank you!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

CKPPY commented 10 months ago

可以给我看看你完整的报错信息吗

---Original--- From: @.> Date: Mon, Aug 21, 2023 10:48 AM To: @.>; Cc: @.**@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150)

您好,最近我也碰到了这个问题,请问您是如何解决,谢谢!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

HADDYIZE commented 10 months ago

每次训练的时候正常,但是测试的时候会报错,进程停掉了 2023-08-20 22:05:52] [INFO]: Model Initialization Finished! Transforming: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 130956/133857 [25:05<00:41, 70.12it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3680843 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3680845 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3680846 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 3680844) of binary: /home/ahu/anaconda3/envs/gait/bin/python Traceback (most recent call last): File "/home/ahu/anaconda3/envs/gait/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ahu/anaconda3/envs/gait/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

opengait/main.py FAILED

Failures:

-------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-08-20_22:31:01 host : ahu-SYS-7048GR-TR rank : 1 (local_rank: 1) exitcode : -9 (pid: 3680844) error_file: traceback : Signal 9 (SIGKILL) received by PID 3680844
CKPPY commented 10 months ago

我是在训练的时候报的错,咱们不是同一个报错,只是所有的报错都会伴随抛出一个ChildFailedError,我在issue里的截图不完整

---Original--- From: @.> Date: Mon, Aug 21, 2023 13:01 PM To: @.>; Cc: @.**@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150)

每次训练的时候正常,但是测试的时候会报错,进程停掉了 2023-08-20 22:05:52] [INFO]: Model Initialization Finished! Transforming: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 130956/133857 [25:05<00:41, 70.12it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3680843 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3680845 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3680846 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 3680844) of binary: /home/ahu/anaconda3/envs/gait/bin/python Traceback (most recent call last): File "/home/ahu/anaconda3/envs/gait/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ahu/anaconda3/envs/gait/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ahu/anaconda3/envs/gait/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

opengait/main.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-08-20_22:31:01 host : ahu-SYS-7048GR-TR rank : 1 (local_rank: 1) exitcode : -9 (pid: 3680844) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 3680844

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

HADDYIZE commented 10 months ago

好的,非常感谢!

CKPPY commented 10 months ago

交流愉快,你可以看看报错前半部分error和warning的解决方案

---Original--- From: @.> Date: Mon, Aug 21, 2023 13:17 PM To: @.>; Cc: @.**@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150)

好的,非常感谢!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

CKPPY commented 9 months ago

所有代码的报错都会抛出子进程报错,我在issue里贴的截图是不完整的 需要具体问题具体解决

---Original--- From: @.> Date: Sat, Sep 16, 2023 15:25 PM To: @.>; Cc: @.**@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150)

Solved

我在训练时,遇到了和你类似的问题,请问您是如何解决的

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

EshaalFatima01 commented 9 months ago

Hello, please guide me to solve this error on google colab

**/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

/usr/local/bin/python: can't open file 'opengait/main.py': [Errno 2] No such file or directory /usr/local/bin/python: can't open file 'opengait/main.py': [Errno 2] No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 21458) of binary: /usr/local/bin/python Traceback (most recent call last): File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

opengait/main.py FAILED

Failures: [1]: time : 2023-09-18_06:53:26 host : a7867dc7575a rank : 1 (local_rank: 1) exitcode : 2 (pid: 21459) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-09-18_06:53:26 host : a7867dc7575a rank : 0 (local_rank: 0) exitcode : 2 (pid: 21458) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html**

EshaalFatima01 commented 9 months ago

When I am running this code on my laptop having this error

[W C:\b\abs_abjetg6_iu\croot\pytorch_1686932924616\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LENOVO-PC]:23456 (system error: 10049 - The requested address is not valid in its context.). Traceback (most recent call last): File "main.py", line 62, in torch.distributed.init_process_group(backend='gloo', init_method='env://') File "D:\Anaconda\envs\OpenGait\lib\site-packages\torch\distributed\distributed_c10d.py", line 900, in init_process_group

github-actions[bot] commented 6 months ago

Stale issue message

Hjc7719 commented 6 months ago

When I am running this code on my laptop having this error

[W C:\b\abs_abjetg6_iu\croot\pytorch_1686932924616\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [LENOVO-PC]:23456 (system error: 10049 - The requested address is not valid in its context.). Traceback (most recent call last): File "main.py", line 62, in torch.distributed.init_process_group(backend='gloo', init_method='env://') File "D:\Anaconda\envs\OpenGait\lib\site-packages\torch\distributed\distributed_c10d.py", line 900, in init_process_group

According to the answer from ChatGPT, I solved this problem in the following way. Create a .py file and use subprocess.run, you can run the code in another way.

import subprocess

def main(): distributed_command = [ "python", "-m", "torch.distributed.launch", f"--nproc_per_node=2", "opengait/main.py", "--cfgs", "configs/DGv2/DeepGaitV2P3D_gait3d.yaml", "--phase", "train" ]

# Execute the command
subprocess.run(distributed_command, check=True)

if name=='main': main()

Hjc7719 commented 6 months ago

Hello, please guide me to solve this error on google colab

**/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

/usr/local/bin/python: can't open file 'opengait/main.py': [Errno 2] No such file or directory /usr/local/bin/python: can't open file 'opengait/main.py': [Errno 2] No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 21458) of binary: /usr/local/bin/python Traceback (most recent call last): File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

opengait/main.py FAILED

Failures: [1]: time : 2023-09-18_06:53:26 host : a7867dc7575a rank : 1 (local_rank: 1) exitcode : 2 (pid: 21459) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-09-18_06:53:26 host : a7867dc7575a rank : 0 (local_rank: 0) exitcode : 2 (pid: 21458) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html**

Try to add the following line in the parameter parsing section of main.py : parser.add_argument('--local-rank',type=str,default='0',help='这是我自己根据报错加的')

Hjc7719 commented 6 months ago

Does anyone have the same issue with me, when using DGv2_P3D to train the silhouette images of Gait3D, the model converges well on the training set, but the rank-1, mAP is very low on the test set?

Hjc7719 commented 5 months ago

可以考虑重装pytorch或是参考【partially initialized module ‘subprocess‘ has no attribute ‘check_output‘】https://mbd.baidu.com/ma/s/F2Cg5tla

---Original--- From: @.> Date: Sun, Jan 28, 2024 15:08 PM To: @.>; Cc: @.**@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150)

@Hjc7719 你好,我的也是在train的时候报错,完整报错内容如下 ~/code/OpenGait$ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 opengait/main.py --cfgs ./configs/baseline/baseline.yaml --phase train /home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( WARNING:torch.distributed.run: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 619, in _syscmd_uname output = subprocess.check_output(('uname', option), AttributeError: module 'subprocess' has no attribute 'check_output' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "opengait/main.py", line 4, in <module> import torch File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 198, in <module> _load_global_deps() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 144, in _load_global_deps if platform.system() == 'Windows' or sys.executable == 'torch_deploy': File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 897, in system return uname().system File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 863, in uname processor = _syscmd_uname('-p', '') File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 622, in _syscmd_uname except (OSError, subprocess.CalledProcessError): AttributeError: module 'subprocess' has no attribute 'CalledProcessError' Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 619, in _syscmd_uname output = subprocess.check_output(('uname', option), AttributeError: module 'subprocess' has no attribute 'check_output' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "opengait/main.py", line 4, in <module> import torch File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 198, in <module> _load_global_deps() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 144, in _load_global_deps if platform.system() == 'Windows' or sys.executable == 'torch_deploy': File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 897, in system return uname().system File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 863, in uname processor = _syscmd_uname('-p', '') File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 622, in _syscmd_uname except (OSError, subprocess.CalledProcessError): AttributeError: module 'subprocess' has no attribute 'CalledProcessError' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 202107) of binary: /home/abc/anaconda3/envs/OpenGait3.8/bin/python Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module> main() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ opengait/main.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-01-28_14:59:19 host : dell-PowerEdge-R740 rank : 1 (local_rank: 1) exitcode : 1 (pid: 202108) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-01-28_14:59:19 host : dell-PowerEdge-R740 rank : 0 (local_rank: 0) exitcode : 1 (pid: 202107) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

miomiora commented 5 months ago

可以考虑重装pytorch或是参考【partially initialized module ‘subprocess‘ has no attribute ‘check_output‘】https://mbd.baidu.com/ma/s/F2Cg5tla ---Original--- From: @.> Date: Sun, Jan 28, 2024 15:08 PM To: @.>; Cc: @.**@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150) @Hjc7719 你好,我的也是在train的时候报错,完整报错内容如下 ~/code/OpenGait$ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 opengait/main.py --cfgs ./configs/baseline/baseline.yaml --phase train /home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( WARNING:torch.distributed.run: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 619, in _syscmd_uname output = subprocess.check_output(('uname', option), AttributeError: module 'subprocess' has no attribute 'check_output' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "opengait/main.py", line 4, in <module> import torch File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 198, in <module> _load_global_deps() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 144, in _load_global_deps if platform.system() == 'Windows' or sys.executable == 'torch_deploy': File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 897, in system return uname().system File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 863, in uname processor = _syscmd_uname('-p', '') File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 622, in _syscmd_uname except (OSError, subprocess.CalledProcessError): AttributeError: module 'subprocess' has no attribute 'CalledProcessError' Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 619, in _syscmd_uname output = subprocess.check_output(('uname', option), AttributeError: module 'subprocess' has no attribute 'check_output' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "opengait/main.py", line 4, in <module> import torch File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 198, in <module> _load_global_deps() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 144, in _load_global_deps if platform.system() == 'Windows' or sys.executable == 'torch_deploy': File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 897, in system return uname().system File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 863, in uname processor = _syscmd_uname('-p', '') File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 622, in _syscmd_uname except (OSError, subprocess.CalledProcessError): AttributeError: module 'subprocess' has no attribute 'CalledProcessError' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 202107) of binary: /home/abc/anaconda3/envs/OpenGait3.8/bin/python Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module> main() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ opengait/main.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-01-28_14:59:19 host : dell-PowerEdge-R740 rank : 1 (local_rank: 1) exitcode : 1 (pid: 202108) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-01-28_14:59:19 host : dell-PowerEdge-R740 rank : 0 (local_rank: 0) exitcode : 1 (pid: 202107) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

谢谢回复,我刚刚调好了,虽然我也不知道改了什么东西跑起来了,目前已经在train了

Hjc7719 commented 5 months ago

不用谢,我也没能帮到你什么。祝愿你能获得期待的成果。

---Original--- From: @.> Date: Sun, Jan 28, 2024 17:17 PM To: @.>; Cc: @.**@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150)

可以考虑重装pytorch或是参考【partially initialized module ‘subprocess‘ has no attribute ‘check_output‘】https://mbd.baidu.com/ma/s/F2Cg5tla … ---Original--- From: @.> Date: Sun, Jan 28, 2024 15:08 PM To: @.>; Cc: @.@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150) @Hjc7719 你好,我的也是在train的时候报错,完整报错内容如下 ~/code/OpenGait$ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 opengait/main.py --cfgs ./configs/baseline/baseline.yaml --phase train /home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( WARNING:torch.distributed.run: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 619, in _syscmd_uname output = subprocess.check_output(('uname', option), AttributeError: module 'subprocess' has no attribute 'check_output' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "opengait/main.py", line 4, in <module> import torch File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 198, in <module> _load_global_deps() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 144, in _load_global_deps if platform.system() == 'Windows' or sys.executable == 'torch_deploy': File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 897, in system return uname().system File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 863, in uname processor = _syscmd_uname('-p', '') File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 622, in _syscmd_uname except (OSError, subprocess.CalledProcessError): AttributeError: module 'subprocess' has no attribute 'CalledProcessError' Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 619, in _syscmd_uname output = subprocess.check_output(('uname', option), AttributeError: module 'subprocess' has no attribute 'check_output' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "opengait/main.py", line 4, in <module> import torch File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 198, in <module> _load_global_deps() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 144, in _load_global_deps if platform.system() == 'Windows' or sys.executable == 'torch_deploy': File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 897, in system return uname().system File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 863, in uname processor = _syscmd_uname('-p', '') File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 622, in _syscmd_uname except (OSError, subprocess.CalledProcessError): AttributeError: module 'subprocess' has no attribute 'CalledProcessError' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 202107) of binary: /home/abc/anaconda3/envs/OpenGait3.8/bin/python Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module> main() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ opengait/main.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-01-28_14:59:19 host : dell-PowerEdge-R740 rank : 1 (local_rank: 1) exitcode : 1 (pid: 202108) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-01-28_14:59:19 host : dell-PowerEdge-R740 rank : 0 (local_rank: 0) exitcode : 1 (pid: 202107) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

谢谢回复,我刚刚调好了,虽然我也不知道改了什么东西跑起来了,目前已经在train了

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

miomiora commented 5 months ago

不用谢,我也没能帮到你什么。祝愿你能获得期待的成果。 ---Original--- From: @.> Date: Sun, Jan 28, 2024 17:17 PM To: @.>; Cc: @.**@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150) 可以考虑重装pytorch或是参考【partially initialized module ‘subprocess‘ has no attribute ‘check_output‘】https://mbd.baidu.com/ma/s/F2Cg5tla … ---Original--- From: @.> Date: Sun, Jan 28, 2024 15:08 PM To: @.>; Cc: @.@.>; Subject: Re: [ShiqiYu/OpenGait] Problem with Trainning (Issue #150) @Hjc7719 你好,我的也是在train的时候报错,完整报错内容如下 ~/code/OpenGait$ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 opengait/main.py --cfgs ./configs/baseline/baseline.yaml --phase train /home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( WARNING:torch.distributed.run: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 619, in _syscmd_uname output = subprocess.check_output(('uname', option), AttributeError: module 'subprocess' has no attribute 'check_output' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "opengait/main.py", line 4, in <module> import torch File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 198, in <module> _load_global_deps() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 144, in _load_global_deps if platform.system() == 'Windows' or sys.executable == 'torch_deploy': File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 897, in system return uname().system File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 863, in uname processor = _syscmd_uname('-p', '') File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 622, in _syscmd_uname except (OSError, subprocess.CalledProcessError): AttributeError: module 'subprocess' has no attribute 'CalledProcessError' Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 619, in _syscmd_uname output = subprocess.check_output(('uname', option), AttributeError: module 'subprocess' has no attribute 'check_output' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "opengait/main.py", line 4, in <module> import torch File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 198, in <module> _load_global_deps() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/init.py", line 144, in _load_global_deps if platform.system() == 'Windows' or sys.executable == 'torch_deploy': File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 897, in system return uname().system File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 863, in uname processor = _syscmd_uname('-p', '') File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/platform.py", line 622, in _syscmd_uname except (OSError, subprocess.CalledProcessError): AttributeError: module 'subprocess' has no attribute 'CalledProcessError' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 202107) of binary: /home/abc/anaconda3/envs/OpenGait3.8/bin/python Traceback (most recent call last): File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module> main() File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/abc/anaconda3/envs/OpenGait3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ opengait/main.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-01-28_14:59:19 host : dell-PowerEdge-R740 rank : 1 (local_rank: 1) exitcode : 1 (pid: 202108) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-01-28_14:59:19 host : dell-PowerEdge-R740 rank : 0 (local_rank: 0) exitcode : 1 (pid: 202107) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.> 谢谢回复,我刚刚调好了,虽然我也不知道改了什么东西跑起来了,目前已经在train了 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.>

感谢祝福❀

github-actions[bot] commented 3 months ago

Stale issue message