apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.73k stars 6.81k forks source link

kvstore test failures #18098

Open szha opened 4 years ago

szha commented 4 years ago

Description

test_kvstore.py::test_aggregator/test_sparse_aggregator segfault

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-cpu/detail/PR-18025/30/pipeline#step-191-log-1085 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-cpu/detail/PR-18025/44/pipeline#step-263-log-1091

dist_device_sync_kvstore.py::test_sync_push_pull

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18025/57/pipeline#step-760-log-1665

szha commented 4 years ago

and dist-kvstore tests timed out http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18025/46/pipeline/

leezu commented 4 years ago

related https://github.com/apache/incubator-mxnet/issues/17829

szha commented 4 years ago

test_distributed_training-gpu.sh hangs. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18025/58/pipeline/426