Right now pytorch-cifar, single p3.16xlarge ends last epoch with following error coming from all training processes
cc @bearpelican
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda_private.h:40: driver shutting down
Right now pytorch-cifar, single p3.16xlarge ends last epoch with following error coming from all training processes
cc @bearpelican