ARCC-RACE / deepracer-for-dummies

a quick way to get up and running with local deepracer training environment
66 stars 28 forks source link

Sagemaker is not starting so training is not happening. #9

Closed SnShine closed 5 years ago

SnShine commented 5 years ago

I followed the post here: https://medium.com/@autonomousracecarclub/how-to-run-deepracer-locally-to-save-your-wallet-13ccc878687.

When ran ./start.sh this is the output:

Creating minio ... done
Creating rl_coach ... done
Creating robomaker ... done
waiting for containers to start up...
Attempting to pull up sagemaker logs...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.
Attempting to open vnc viewer...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.
Starting memory manager...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.

VNC viewer and memory manager are starting but sagemaker logs are not working.

As this is the command in ./start.sh for logs: docker logs -f $(docker ps | awk ' /sagemaker/ { print $1 }'), I looked up docker ps and this is the output:

CONTAINER ID        IMAGE                                 COMMAND                  CREATED             STATUS                             PORTS                    NAMES
7ee88b200450        crr0004/deepracer_robomaker:console   "/bin/bash -c './run…"   43 seconds ago      Up 41 seconds                      0.0.0.0:8080->5900/tcp   robomaker
e62f0bc2e845        aschu/rl_coach                        "/bin/sh -c '(cd rl_…"   44 seconds ago      Up 42 seconds                                               rl_coach
2748106431ff        minio/minio                           "/usr/bin/docker-ent…"   45 seconds ago      Up 43 seconds (health: starting)   0.0.0.0:9000->9000/tcp   minio

So, it seems sagemaker is not at all running and that's why logs are not shown. How to fix this issue? Is something missing from start.sh or docker-compose.yml?

SnShine commented 5 years ago

And running ./start.sh for the first time game this error: ERROR: Network sagemaker-local declared as external, but could not be found. Please create the network manually using `docker network create sagemaker-local` and try again.

And I manually created that network as mentioned. Other than this, I haven't made any changes.

Michael-Equi commented 5 years ago

Hi SnShine,

Sorry, you are having these problems. Do you have CUDA/CUDNN installed? You can check by running nvcc --version (If that command is not found run sudo apt install nvidia-cuda-toolkit). Since I already had it all set up before I worked on DeepRacer local I may have not realized I needed to put it in the guide. Run through the following and tell me if it works. If so I will add it to the guide.

CUDA 10.0 Install: https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=debnetwork sudo apt-get install cuda-libraries-10.0

SnShine commented 5 years ago

Hey I have cuda installed and also you can add to the guide to pull sagemaker docker image which is not in init or readme.

I'm not sure how exactly I solved it but it is working now. Maybe issues with different python versions. But it's working fine when I used conda's python.