Open Snapple49 opened 5 years ago
So after running some more tests, I figured out that part of the problem was me not giving enough ram to the pod. I guess trying to run AlexNet on 10k images with 2GB of ram was rather ambitious... However, even with about 200 images it crashes unless I give the pods about 8GB ram, is this normal behaviour? This was again just my custom test, I did not yet test the included mnist example with more ram, I'll report back on that tomorrow.
problem
I was recently at a NVIDIA course where we got an introcution to Deep Learning, and I really like DIGITS and I would like to explore it a bit more. So I wanted to try to run it in our local kubernetes cluster which has 10 GTX 1080 Ti cards, and I'm running into a few issues, mainly I cannot run training either on Tensorflow or Caffe (I'm not a ML guy but I prefer TF due to it being standard in our group). I sort of managed to run the mnist example, but only using DIGITS container build 17.10, and now I wanted to try something more custom. Currently I'm hosting a pod of the
nvcr.io/nvidia/digits:19.09-tensorflow
image but I tried caffe as well, with similar results.I'm seeing a lot of warnings in the embedded examples as well, it seems a lot of things are deprecated and it does seem like not a lot has been changed in 2 years looking through the repository. I would really love to be able to utilize DIGITS though, it is an awesome tool!
details platform: kubernetes version: 1.14.6 host os: ubuntu 16.04.4 NVIDIA drivers nvidia-smi:
NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1
docker image:nvcr.io/nvidia/digits:19.09-tensorflow
os in docker image: Ubuntu 18.04.2 docker version17.03.2-ce
nvidia-container-runtime verson:Not sure what else to dig up, looking at this I'm thinking it coooould be the nvidia container runtime, but as mentioned Tensorflow complains about a lot of stuff being deprecated. Any help is appreciated!
job specifics image size 256x256 (dataset from kaggle, creating dataset works fine) only change from default was 5 epochs, more details from log:
Network.py:
logs
Logs from submitting the job to last message after failing job: