ARCC-RACE / deepracer-for-dummies

a quick way to get up and running with local deepracer training environment
66 stars 28 forks source link

sagemaker is not running #25

Closed mayurmadnani closed 5 years ago

mayurmadnani commented 5 years ago

I am unable to have the complete setup running properly. When I run the start.sh script, I have three containers running and two terminals pop up, one for vncviewer and another for memory manager. Looking at the script, there should be another one for sagemaker logs. I checked the docker containers running and I did not have sagemaker one there. sagemaker docker output Also, even after I give the correct sudo password to the memory management terminal, nothing comes up after that. Running for some time I found it prints empty line. open windows I have double checked that the sagemaker-local network connection exists, the necessary docker images are present and I have nvidia drivers installed.

docker images nvidiasmi

Below is the list of packages installed in my conda env

>> conda list
# packages in environment at /opt/miniconda3/envs/deepracer:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
ca-certificates           2019.5.15                     0  
certifi                   2019.6.16                py36_1  
cuda10.0                  1.0                           0    fragcolor
cudatoolkit               10.0.130                      0  
cudnn                     7.3.1                cuda10.0_0  
libedit                   3.1.20181209         hc058e9b_0  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 9.1.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
ncurses                   6.1                  he6710b0_1  
openssl                   1.1.1c               h7b6447c_1  
pip                       19.1.1                   py36_0  
python                    3.6.9                h265db76_0  
readline                  7.0                  h7b6447c_5  
setuptools                41.0.1                   py36_0  
sqlite                    3.29.0               h7b6447c_0  
tk                        8.6.8                hbc83047_0  
wheel                     0.33.4                   py36_0  
xz                        5.2.4                h14c3975_4  
zlib                      1.2.11               h7b6447c_3  
>> 
Michael-Equi commented 5 years ago

Has the local deepracer worked on your computer before?

Michael-Equi commented 5 years ago

The reason that the memory manager is printing blank lines is that the folders are not getting created properly (sagemaker is not working). I am taking a look at that issue now since it has also stopped working on my computer.

Michael-Equi commented 5 years ago

I pulled the repo and it seems to work now. Are you using all the default hyperparameters, reward function, and actions space? Sometimes the behavior gets a bit erratic so you can always recloning the repository or restarting your computer.

mayurmadnani commented 5 years ago

@Michael-Equi the first time I ran it worked but since gpu was not detected I deleted the docker images and the project files. However, after that I am not having a successful run after that. I am using all the default parameters. sagemaker still does not work for me, i tried again with the latest changes

mayurmadnani commented 5 years ago

scripts/training/start.sh fails at

gnome-terminal -x sh -c "docker logs -f $(docker ps | awk ' /sagemaker/ { print $1 }')" i mentioned before, sagemaker container is not running and so next steps fails

Michael-Equi commented 5 years ago

Have you gone through all the steps here? https://medium.com/@autonomousracecarclub/how-to-run-deepracer-locally-to-save-your-wallet-13ccc878687

Michael-Equi commented 5 years ago

Another thing I could suggest is running through the python 3 venv steps under sagemaker in the README of this repository https://github.com/crr0004/deepracer

Michael-Equi commented 5 years ago

Another good place to ask these questions are here http://join.deepracing.io That is where the people who created the foundational software for running deepracer locally are. There is a lot of experience in that slack channel that might be able to help you better. Just notify me once you find a solution so it can be documented.

Lacan82 commented 5 years ago

I'm running into an issue with sagemaker myself. if I go into the docker folder and run docker-compose up

the output showing as follows:

rl_coach     | WARNING:sagemaker:Parameter `image_name` is specified, `toolkit`, `toolkit_version`, `framework` are going to be ignored when choosing the image.
rl_coach     | INFO:sagemaker:Creating training-job with name: rl-deepracer-sagemaker
robomaker    | Starting >>> deepracer_simulation
robomaker    | [0.363s] WARNING:colcon.colcon_ros.prefix_path.catkin:The path '/opt/ros/kinetic' in the environment variable CMAKE_PREFIX_PATH seems to be a catkin workspace but it doesn't contain any 'local_setup.*' files. Maybe the catkin version is not up-to-date?
robomaker    | Starting >>> sagemaker_rl_agent
rl_coach     | Looking for config file: /root/.sagemaker/config.yaml
rl_coach     | Model checkpoints and other metadata will be stored at: s3://bucket/rl-deepracer-sagemaker
rl_coach     | Uploading to s3://bucket/rl-deepracer-sagemaker
rl_coach     | s3.ServiceResource()
rl_coach     | Using provided s3_client
rl_coach     | Starting training job
rl_coach     | Using /robo/container for container temp files
rl_coach     | Using /robo/container for container temp files
rl_coach     | Trying to launch image: crr0004/sagemaker-rl-tensorflow:nvidia
Creating tmpze87r206_algo-1-qywcu_1 ... error
rl_coach     | 
rl_coach     | ERROR: for tmpze87r206_algo-1-qywcu_1  Cannot create container for service algo-1-qywcu: Unknown runtime specified nvidia
rl_coach     | 
rl_coach     | ERROR: for algo-1-qywcu  Cannot create container for service algo-1-qywcu: Unknown runtime specified nvidia
rl_coach     | Encountered errors while bringing up the project.
rl_coach     | RuntimeError("Failed to run: ['docker-compose', '-f', '/robo/container/tmpze87r206/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1")
Michael-Equi commented 5 years ago

@Lacan82 Do you have nvidia-docker installed?

Michael-Equi commented 5 years ago

https://github.com/NVIDIA/nvidia-docker

Michael-Equi commented 5 years ago

I recommend reading through the readme here https://github.com/crr0004/deepracer for better understanding of environment variables and GPU usage since it looks like your issue is related to that.

Lacan82 commented 5 years ago

So, the issue was exactly Nvidia docker, however I had followed those instructions, but it needs to be the nvidia-docker2, first one was already working but the second one I hadn't installed once installed it. it worked.

mayurmadnani commented 5 years ago

for some reason, robomaker should have built a sagemaker container which din't happen. I attached to the container and ran the python file after which it started working. I see now the crr0004/sagemaker-rl-tensorflow:console image built

albertsundjaja commented 5 years ago

I experienced the same issue. It definitely worked before ( I even submitted a working model to the DeepRacer). Doing a fresh install somehow make it unable to work. How did you solve the issue?

EDIT: solved it it seems that for some reason the sagemaker docker image is not there

if anyone is experiencing the same thing, try running this:

  1. pull the sagemaker image docker pull crr0004/sagemaker-rl-tensorflow:console

  2. run the script training/start.sh

  3. cd into the folder deepracer/rl_coach run python rl_deepracer_coach_robomaker.py

  4. the sagemaker should run, but it won't connect with other containers. read on stop everything using training/stop.sh

  5. now rerun the train using training/start.sh

for some unknown reason, now it works in mine