Closed mayurmadnani closed 5 years ago
Has the local deepracer worked on your computer before?
The reason that the memory manager is printing blank lines is that the folders are not getting created properly (sagemaker is not working). I am taking a look at that issue now since it has also stopped working on my computer.
I pulled the repo and it seems to work now. Are you using all the default hyperparameters, reward function, and actions space? Sometimes the behavior gets a bit erratic so you can always recloning the repository or restarting your computer.
@Michael-Equi the first time I ran it worked but since gpu was not detected I deleted the docker images and the project files. However, after that I am not having a successful run after that. I am using all the default parameters. sagemaker still does not work for me, i tried again with the latest changes
scripts/training/start.sh fails at
gnome-terminal -x sh -c "docker logs -f $(docker ps | awk ' /sagemaker/ { print $1 }')" i mentioned before, sagemaker container is not running and so next steps fails
Have you gone through all the steps here? https://medium.com/@autonomousracecarclub/how-to-run-deepracer-locally-to-save-your-wallet-13ccc878687
Another thing I could suggest is running through the python 3 venv steps under sagemaker in the README of this repository https://github.com/crr0004/deepracer
Another good place to ask these questions are here http://join.deepracing.io That is where the people who created the foundational software for running deepracer locally are. There is a lot of experience in that slack channel that might be able to help you better. Just notify me once you find a solution so it can be documented.
I'm running into an issue with sagemaker myself. if I go into the docker folder and run docker-compose up
the output showing as follows:
rl_coach | WARNING:sagemaker:Parameter `image_name` is specified, `toolkit`, `toolkit_version`, `framework` are going to be ignored when choosing the image.
rl_coach | INFO:sagemaker:Creating training-job with name: rl-deepracer-sagemaker
robomaker | Starting >>> deepracer_simulation
robomaker | [0.363s] WARNING:colcon.colcon_ros.prefix_path.catkin:The path '/opt/ros/kinetic' in the environment variable CMAKE_PREFIX_PATH seems to be a catkin workspace but it doesn't contain any 'local_setup.*' files. Maybe the catkin version is not up-to-date?
robomaker | Starting >>> sagemaker_rl_agent
rl_coach | Looking for config file: /root/.sagemaker/config.yaml
rl_coach | Model checkpoints and other metadata will be stored at: s3://bucket/rl-deepracer-sagemaker
rl_coach | Uploading to s3://bucket/rl-deepracer-sagemaker
rl_coach | s3.ServiceResource()
rl_coach | Using provided s3_client
rl_coach | Starting training job
rl_coach | Using /robo/container for container temp files
rl_coach | Using /robo/container for container temp files
rl_coach | Trying to launch image: crr0004/sagemaker-rl-tensorflow:nvidia
Creating tmpze87r206_algo-1-qywcu_1 ... error
rl_coach |
rl_coach | ERROR: for tmpze87r206_algo-1-qywcu_1 Cannot create container for service algo-1-qywcu: Unknown runtime specified nvidia
rl_coach |
rl_coach | ERROR: for algo-1-qywcu Cannot create container for service algo-1-qywcu: Unknown runtime specified nvidia
rl_coach | Encountered errors while bringing up the project.
rl_coach | RuntimeError("Failed to run: ['docker-compose', '-f', '/robo/container/tmpze87r206/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1")
@Lacan82 Do you have nvidia-docker installed?
I recommend reading through the readme here https://github.com/crr0004/deepracer for better understanding of environment variables and GPU usage since it looks like your issue is related to that.
So, the issue was exactly Nvidia docker, however I had followed those instructions, but it needs to be the nvidia-docker2, first one was already working but the second one I hadn't installed once installed it. it worked.
for some reason, robomaker should have built a sagemaker container which din't happen. I attached to the container and ran the python file after which it started working. I see now the crr0004/sagemaker-rl-tensorflow:console image built
I experienced the same issue. It definitely worked before ( I even submitted a working model to the DeepRacer). Doing a fresh install somehow make it unable to work. How did you solve the issue?
EDIT: solved it it seems that for some reason the sagemaker docker image is not there
if anyone is experiencing the same thing, try running this:
pull the sagemaker image
docker pull crr0004/sagemaker-rl-tensorflow:console
run the script training/start.sh
cd into the folder deepracer/rl_coach
run python rl_deepracer_coach_robomaker.py
the sagemaker should run, but it won't connect with other containers. read on
stop everything using training/stop.sh
now rerun the train using training/start.sh
for some unknown reason, now it works in mine
I am unable to have the complete setup running properly. When I run the start.sh script, I have three containers running and two terminals pop up, one for vncviewer and another for memory manager. Looking at the script, there should be another one for sagemaker logs. I checked the docker containers running and I did not have sagemaker one there. Also, even after I give the correct sudo password to the memory management terminal, nothing comes up after that. Running for some time I found it prints empty line. I have double checked that the sagemaker-local network connection exists, the necessary docker images are present and I have nvidia drivers installed.
Below is the list of packages installed in my conda env