diux-dev / cluster

train on AWS
75 stars 15 forks source link

pytorch cifar example doesn't quit gracefully #47

Open yaroslavvb opened 6 years ago

yaroslavvb commented 6 years ago

Right now pytorch-cifar, single p3.16xlarge ends last epoch with following error coming from all training processes

cc @bearpelican

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda_private.h:40: driver shutting down