apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.75k stars 6.8k forks source link

Distributed training program cannot exit after traning finish #21005

Open happylee524 opened 2 years ago

happylee524 commented 2 years ago

Description

(A clear and concise description of what the bug is.) I run a distributed training program using "launch.py" method,but after training task finish, the program cannot exit and return.

Error Message

This is the logging. 运行日志

This is my code 代码

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Start command: python /usr/local/lib/python3.8/dist-packages/mxnet/tools/launch.py -n 2 -H /tmp/algorithm/Host --sync-dst-dir /tmp/algorithm/mnist_sync --launcher ssh "python /tmp/algorithm/image_classification.py --dataset mnist --model alexnet --epochs 3 --gpus 0,1"
  2. I run my code in two k8s pod with image mxnet/python:2.0.0beta1_gpu_cu110_py3
  3. mxnet version: 2.0.0
  4. My code : image_classification.py copy from mxnet github
github-actions[bot] commented 2 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.