Sense-X / Co-DETR

[ICCV 2023] DETRs with Collaborative Hybrid Assignments Training
MIT License
1.02k stars 113 forks source link

The error : torch.distributed.elastic.multiprocessing.errors.ChildFailedError #89

Open punyawat-jar opened 11 months ago

punyawat-jar commented 11 months ago

Hello, I try to train the model from the script here: sh tools/dist_train.sh projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py 1 .

I got the error:

tools/..: NOTE: Redirects are currently not supported in Windows or MacOs. C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions FutureWarning, [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.). OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 3) local_rank: 0 (pid: 14832) of binary: C:\anaconda\envs\co-detr\python.exe Traceback (most recent call last): File "C:\anaconda\envs\co-detr\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "C:\anaconda\envs\co-detr\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py", line 193, in main() File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py", line 189, in main launch(args) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py", line 174, in launch run(args) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\run.py", line 718, in run ) (*cmd_args) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launcher\api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launcher\api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2023-11-15_17:17:52 host : xxx rank : 0 (local_rank: 0) exitcode : 3 (pid: 14832) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
qomol commented 10 months ago

I got the same error, did you fix it? :(

liuzhibin55 commented 4 months ago

Hello, I try to train the model from the script here: sh tools/dist_train.sh projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py 1 .

I got the error:

tools/..: NOTE: Redirects are currently not supported in Windows or MacOs. C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions FutureWarning, [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.). OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 3) local_rank: 0 (pid: 14832) of binary: C:\anaconda\envs\co-detr\python.exe Traceback (most recent call last): File "C:\anaconda\envs\co-detr\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "C:\anaconda\envs\co-detr\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py", line 193, in main() File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py", line 189, in main launch(args) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py", line 174, in launch run(args) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\run.py", line 718, in run ) (*cmd_args) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launcher\api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launcher\api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2023-11-15_17:17:52 host : xxx rank : 0 (local_rank: 0) exitcode : 3 (pid: 14832) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I got the same error, did you fix it?thank you

TempleX98 commented 4 months ago

You can try to use another port instead of 29500.

On Thu, Jun 27, 2024 at 4:51 PM Zhuofan Zong @.***> wrote:

On Thu, Jun 27, 2024 at 2:54 PM liuzhibin55 @.***> wrote:

Hello, I try to train the model from the script here: sh tools/dist_train.sh projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py 1 .

I got the error:

tools/..: NOTE: Redirects are currently not supported in Windows or MacOs. C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions FutureWarning, [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.). OMP: Error #15 https://github.com/Sense-X/Co-DETR/issues/15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 3) local_rank: 0 (pid: 14832) of binary: C:\anaconda\envs\co-detr\python.exe Traceback (most recent call last): File "C:\anaconda\envs\co-detr\lib\runpy.py", line 193, in _run_module_as_main " main", mod_spec) File "C:\anaconda\envs\co-detr\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py", line 193, in main() File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py", line 189, in main launch(args) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launch.py", line 174, in launch run(args) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\run.py", line 718, in run ) (cmd_args) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launcher\api.py", line 131, in call* return launch_agent(self._config, self._entrypoint, list(args)) File "C:\anaconda\envs\co-detr\lib\site-packages\torch\distributed\launcher\api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2023-11-15_17:17:52 host : xxx rank : 0 (local_rank: 0) exitcode : 3 (pid: 14832) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I got the same error, did you fix it?thank you

— Reply to this email directly, view it on GitHub https://github.com/Sense-X/Co-DETR/issues/89#issuecomment-2193934972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIP2XEMNEDEY6P3PWUP7OS3ZJOZKZAVCNFSM6AAAAABJ7IV5KKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJTHEZTIOJXGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>