Closed vsay01 closed 3 years ago
The issue is likely in the logs of another container.
from the terminal, run
docker ps
then for each of the containers run
docker logs <container id>
Look through each of the logs to see if you find any errors and try to fix those
@alexschultz There are there container id, pls refer to attached screens. Looks like there's no issue, can you take a look?
just for completeness, you went through the whole setup process in the readme?
@alexschultz I followed the instructions in https://medium.com/@autonomousracecarclub/how-to-run-deepracer-locally-to-save-your-wallet-13ccc878687 and keep checking in your repo as well. are there anything you want me to double check ?
@alexschultz I run log again on this container ID 3b7ec5214cf2 crr0004/deepracer_robomaker:console This time i see a potential issue:
/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh: line 8: 1038 Illegal instruction (core dumped) python3 -m markov.rollout_worker
================================================================================REQUIRED process [agent-9] has died!
process has died [pid 977, exit code 132, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/root/.ros/log/f28e17d2-da38-11e9-8619-0242ac120004/agent-9.log].
log file: /root/.ros/log/f28e17d2-da38-11e9-8619-0242ac120004/agent-9*.log
Initiating shutdown!
Yes, that is an issue. Is there any more information above that in the logs? Also, ensure your reward function parses correctly. You can use the DeepRacer console (under the create model section) to verify it parses correctly and to make sure there aren't any syntax errors.
@alexschultz I invalidate cache and restart my intellij editor and now no more complain about the math module that i used in my reward function.
I run that ./start.sh again, this time it has different error:
Traceback (most recent call last):
File "/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/car_node.py", line 133, in
It seems like maybe something was renamed to "racecar" that shouldn't have been. Where is that name used within your setup?
Just a guess but it looks like it might be referring to the model folder in the deepracer-for-dummies/docker/volumes/minio/bucket
directory. If so then the you will want to rename that directory to rl-deepracer-sagemaker. There is also a variable in the docker .env file which points to that directory, they need to match.
and here is the content of the .env
WORLD_NAME=Mexico_track LOCAL_ENV_VAR_JSON_PATH=env_vars.json MINIO_ACCESS_KEY=minio MINIO_SECRET_KEY=miniokey AWS_ACCESS_KEY_ID=minio AWS_SECRET_ACCESS_KEY=miniokey AWS_DEFAULT_REGION=us-east-1 S3_ENDPOINT_URL=http://minio:9000 ROS_AWS_REGION=us-east-1 AWS_REGION=us-east-1 MODEL_S3_PREFIX=rl-deepracer-sagemaker MODEL_S3_BUCKET=bucket LOCAL=True MARKOV_PRESET_FILE=deepracer.py XAUTHORITY=/root/.Xauthority DISPLAY_N=:0 METRIC_NAME=reward METRIC_NAMESPACE=deepracer APP_REGION=us-east-1 SAGEMAKER_SHARED_S3_PREFIX=rl-deepracer-sagemaker SAGEMAKER_SHARED_S3_BUCKET=bucket TRAINING_JOB_ARN=aaa METRICS_S3_BUCKET=bucket METRICS_S3_OBJECT_KEY=custom_files/metric.json ROBOMAKER_RUN_TYPE=distributed_training TARGET_REWARD_SCORE=100000 NUMBER_OF_EPISODES=20000 ROBOMAKER_SIMULATION_JOB_ACCOUNT_ID=aaa AWS_ROBOMAKER_SIMULATION_JOB_ID=aaa MODEL_METADATA_FILE_S3_KEY=custom_files/model_metadata.json REWARD_FILE_S3_KEY=custom_files/reward.py BUNDLE_CURRENT_PREFIX=/app/robomaker-deepracer/simulation_ws/ GPU_AVAILABLE=True NUMBER_OF_TRIALS=5
did you verify that your python is valid using the AWS console?
@alexschultz yes i did validate that, and it's fine.
@alexschultz I run with the existing reward function and still the same issue which VNC view connection close.
Can you try running the script to delete the last training run. You will need to use sudo. I'm wondering if there is a file that is sticking around from a previous attempt and maybe it's preventing everything from starting correctly.
@alexschultz I run the delete last training script. when i run start script, i don't see the connection close issue anymore. But no training running.
I run docker logs on sagemaker seems like there's a warning related to memory:
21:M 19 Sep 2019 22:46:04.696 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 21:M 19 Sep 2019 22:46:04.696 # Server initialized 21:M 19 Sep 2019 22:46:04.696 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 21:M 19 Sep 2019 22:46:04.696 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. 21:M 19 Sep 2019 22:46:04.696 * Ready to accept connections
Remove VNC
downlaod it from this link and reinstall it https://www.techspot.com/downloads/5760-vnc-viewer.html
this worked for me
@alexschultz I run log again on this container ID 3b7ec5214cf2 crr0004/deepracer_robomaker:console This time i see a potential issue:
/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh: line 8: 1038 Illegal instruction (core dumped) python3 -m markov.rollout_worker ================================================================================REQUIRED process [agent-9] has died! process has died [pid 977, exit code 132, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/root/.ros/log/f28e17d2-da38-11e9-8619-0242ac120004/agent-9.log]. log file: /root/.ros/log/f28e17d2-da38-11e9-8619-0242ac120004/agent-9*.log Initiating shutdown!
I am getting exactly the same error on a clean install including all pre-setup steps:
''' The VNC desktop is: f9a0f9c19c48:0
Have you tried the x11vnc '-ncache' VNC client-side pixel caching feature yet?
The scheme stores pixel data offscreen on the VNC viewer side for faster retrieval. It should work with any VNC viewer. Try it by running:
x11vnc -ncache 10 ...
One can also add -ncache_cr for smooth 'copyrect' window motion. More info: http://www.karlrunge.com/x11vnc/faq.html#faq-client-caching
[ INFO] [1570225948.169537150]: Finished loading Gazebo ROS API Plugin. [ INFO] [1570225948.738130209, 0.454000000]: Physics dynamic reconfigure ready. [racecar/controller_manager-5] escalating to SIGTERM [WARN] [1570225972.554343, 5.831000]: Controller Spawner error while taking down controllers: transport error completing service call: receive_once[/racecar/controller_manager/unload_controller]: unexpected error [Errno 4] Interrupted system call [gazebo-2] escalating to SIGTERM [INFO] [1570225949.717186, 0.000000]: Controller Spawner: Waiting for service controller_manager/load_controller [INFO] [1570225950.959383, 1.245000]: Controller Spawner: Waiting for service controller_manager/switch_controller [INFO] [1570225950.970375, 1.254000]: Controller Spawner: Waiting for service controller_manager/unload_controller [INFO] [1570225950.978233, 1.259000]: Loading controller: left_rear_wheel_velocity_controller [INFO] [1570225951.307121, 1.426000]: Loading controller: right_rear_wheel_velocity_controller [INFO] [1570225952.026301, 1.823000]: Loading controller: left_front_wheel_velocity_controller [INFO] [1570225952.799754, 2.248000]: Loading controller: right_front_wheel_velocity_controller [INFO] [1570225953.516918, 2.893000]: Loading controller: left_steering_hinge_position_controller [INFO] [1570225954.231451, 3.517000]: Loading controller: right_steering_hinge_position_controller [INFO] [1570225954.577454, 3.792000]: Loading controller: joint_state_controller [INFO] [1570225954.633965, 3.839000]: Controller Spawner: Loaded controllers: left_rear_wheel_velocity_controller, right_rear_wheel_velocity_controller, left_front_wheel_velocity_controller, right_front_wheel_velocity_controller, left_steering_hinge_position_controller, right_steering_hinge_position_controller, joint_state_controller [INFO] [1570225954.645241, 3.847000]: Started controllers: left_rear_wheel_velocity_controller, right_rear_wheel_velocity_controller, left_front_wheel_velocity_controller, right_front_wheel_velocity_controller, left_steering_hinge_position_controller, right_steering_hinge_position_controller, joint_state_controller [INFO] [1570225957.517491, 5.823000]: Shutting down spawner. Stopping and unloading controllers... [INFO] [1570225957.517979, 5.823000]: Stopping all controllers... [INFO] [1570225957.545764, 5.831000]: Unloading all loaded controllers... [INFO] [1570225957.546082, 5.831000]: Trying to unload joint_state_controller n_ws/install/deepracer_simulation/share/deepracer_simulation/launch/distributed_training.launch http://localhost:11311 setting /run_id to 3fee9a46-e6f1-11e9-8327-0242ac120004 process[rosout-1]: started with pid [848] started core service [/rosout] process[gazebo-2]: started with pid [861] process[gazebo_gui-3]: started with pid [876] process[racecar_spawn-4]: started with pid [942] process[racecar/controller_manager-5]: started with pid [1049] process[robot_state_publisher-6]: started with pid [1160] [racecar_spawn-4] process has finished cleanly log file: /root/.ros/log/3fee9a46-e6f1-11e9-8327-0242ac120004/racecar_spawn-4*.log process[car_reset_node-7]: started with pid [1195] process[better_odom-8]: started with pid [1300] process[agent-9]: started with pid [1563] [agent-9] killing on exit [better_odom-8] killing on exit [car_reset_node-7] killing on exit [robot_state_publisher-6] killing on exit [racecar/controller_manager-5] killing on exit [gazebo_gui-3] killing on exit [gazebo-2] killing on exit [rosout-1] killing on exit [master] killing on exit shutting down processing monitor... ... shutting down processing monitor complete done
'''
Further error analysis:
the log file mentioned in the crr0004/deepracer_robomaker:console - ~/.ros/log/0fa43462-e6f2-11e9-8250-0242ac120004/rosout.log
6.627000000 WARN [spawner:72(shutdown) [topics: /clock, /rosout] Controller Spawner error while taking down controllers: transport error completing service call: receive_once[/racecar/controller_manager/unload_controller]: unexpected error [Errno 4] Interrupted system call
I found a similar post, it seems like this issue is because my CPU does not support AVX instructions which the most recent builds of tensorflow require. If anybody has figured out how to build or get a version of tensorflow working on older CPU's please post. Thanks. This issue has more details https://github.com/crr0004/deepracer/issues/35
Have you solve the problem? I got exact the same error.
The error disappeared after I removed vnc and reinstall it from ubuntu software. But the car does not move.
I guess @jhart98169 was right, tensorflow needs newer CPUs to work, otherwise it throws this error. You can run cat /proc/cpuinfo to see if your processor supports AVX
Closing as the cause of this issue was explained above.
When i try to run ./start, vnc is not running and close unexpectedly.
Here is the log:
What should I do?