aws-deepracer-community / deepracer-for-cloud

Creates an AWS DeepRacing training environment which can be deployed in the cloud, or locally on Ubuntu Linux, Windows or Mac.
MIT No Attribution
335 stars 181 forks source link

VNC Viewer the connection close unexpectedly when run ./start.sh #18

Closed vsay01 closed 3 years ago

vsay01 commented 5 years ago

When i try to run ./start, vnc is not running and close unexpectedly.

Here is the log:

vortana@vortana-System-Product-Name:~/Documents/awsdeepracer/deepracer-for-dummies/scripts/training$ ./start.sh 
minio is up-to-date
Recreating rl_coach ... done
Recreating robomaker ... done
waiting for containers to start up...
Attempting to pull up sagemaker logs...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.
Attempting to open vnc viewer...
# Option “-x” is deprecated and might be removed in a later version of gnome-terminal.
# Use “-- ” to terminate the options and put the command line to execute after it.

Screenshot from 2019-09-18 08-42-22

What should I do?

alexschultz commented 5 years ago

The issue is likely in the logs of another container. from the terminal, run docker ps then for each of the containers run docker logs <container id> Look through each of the logs to see if you find any errors and try to fix those

vsay01 commented 5 years ago

@alexschultz There are there container id, pls refer to attached screens. Looks like there's no issue, can you take a look?

Screenshot from 2019-09-18 09-18-18

Screenshot from 2019-09-18 09-18-18

Screenshot from 2019-09-18 09-19-22

alexschultz commented 5 years ago

just for completeness, you went through the whole setup process in the readme?

vsay01 commented 5 years ago

@alexschultz I followed the instructions in https://medium.com/@autonomousracecarclub/how-to-run-deepracer-locally-to-save-your-wallet-13ccc878687 and keep checking in your repo as well. are there anything you want me to double check ?

vsay01 commented 5 years ago

@alexschultz I run log again on this container ID 3b7ec5214cf2 crr0004/deepracer_robomaker:console This time i see a potential issue:


/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh: line 8:  1038 Illegal instruction     (core dumped) python3 -m markov.rollout_worker
================================================================================REQUIRED process [agent-9] has died!
process has died [pid 977, exit code 132, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/root/.ros/log/f28e17d2-da38-11e9-8619-0242ac120004/agent-9.log].
log file: /root/.ros/log/f28e17d2-da38-11e9-8619-0242ac120004/agent-9*.log
Initiating shutdown!

Screenshot from 2019-09-18 12-30-20

alexschultz commented 5 years ago

Yes, that is an issue. Is there any more information above that in the logs? Also, ensure your reward function parses correctly. You can use the DeepRacer console (under the create model section) to verify it parses correctly and to make sure there aren't any syntax errors.

vsay01 commented 5 years ago

@alexschultz I invalidate cache and restart my intellij editor and now no more complain about the math module that i used in my reward function.

I run that ./start.sh again, this time it has different error:

[ERROR] [1568831750.198815335]: GetModelState: model [racecar] does not exist [ERROR] [1568831752.297439573]: GetModelState: model [racecar] does not exist [ERROR] [1568831754.301693689]: GetModelState: model [racecar] does not exist [ERROR] [1568831756.305461056]: GetModelState: model [racecar] does not exist [ERROR] [1568831760.048601362]: GetModelState: model [racecar] does not exist [ERROR] [1568831762.052780039]: GetModelState: model [racecar] does not exist [ERROR] [1568831764.057131392]: GetModelState: model [racecar] does not exist [ERROR] [1568831766.061474188]: GetModelState: model [racecar] does not exist [ERROR] [1568831768.065429162]: GetModelState: model [racecar] does not exist [ERROR] [1568831770.069432011]: GetModelState: model [racecar] does not exist [ERROR] [1568831772.073287162, 0.477000000]: GetModelState: model [racecar] does not exist [ERROR] [1568831774.077116797, 2.021000000]: GetModelState: model [racecar] does not exist [ERROR] [1568831776.081484112, 4.019000000]: GetModelState: model [racecar] does not exist [ERROR] [1568831778.085346237, 6.019000000]: GetModelState: model [racecar] does not exist [ERROR] [1568831780.089496673, 8.018000000]: GetModelState: model [racecar] does not exist [ERROR] [1568831782.093484653, 10.016000000]: GetModelState: model [racecar] does not exist [ERROR] [1568831784.097553430, 12.016000000]: GetModelState: model [racecar] does not exist /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh: line 8: 1035 Illegal instruction (core dumped) python3 -m markov.rollout_worker ================================================================================REQUIRED process [agent-9] has died! process has died [pid 983, exit code 132, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh name:=agent log:=/root/.ros/log/db2e0ae8-da42-11e9-b32c-0242ac120004/agent-9.log]. log file: /root/.ros/log/db2e0ae8-da42-11e9-b32c-0242ac120004/agent-9*.log Initiating shutdown!

Traceback (most recent call last): File "/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/car_node.py", line 133, in [ERROR] [1568831785.483034708, 12.868000000]: GetModelState: model [racecar] does not exist

alexschultz commented 5 years ago

It seems like maybe something was renamed to "racecar" that shouldn't have been. Where is that name used within your setup? Just a guess but it looks like it might be referring to the model folder in the deepracer-for-dummies/docker/volumes/minio/bucket directory. If so then the you will want to rename that directory to rl-deepracer-sagemaker. There is also a variable in the docker .env file which points to that directory, they need to match.

vsay01 commented 5 years ago

Screenshot from 2019-09-18 15-16-46 and here is the content of the .env

WORLD_NAME=Mexico_track LOCAL_ENV_VAR_JSON_PATH=env_vars.json MINIO_ACCESS_KEY=minio MINIO_SECRET_KEY=miniokey AWS_ACCESS_KEY_ID=minio AWS_SECRET_ACCESS_KEY=miniokey AWS_DEFAULT_REGION=us-east-1 S3_ENDPOINT_URL=http://minio:9000 ROS_AWS_REGION=us-east-1 AWS_REGION=us-east-1 MODEL_S3_PREFIX=rl-deepracer-sagemaker MODEL_S3_BUCKET=bucket LOCAL=True MARKOV_PRESET_FILE=deepracer.py XAUTHORITY=/root/.Xauthority DISPLAY_N=:0 METRIC_NAME=reward METRIC_NAMESPACE=deepracer APP_REGION=us-east-1 SAGEMAKER_SHARED_S3_PREFIX=rl-deepracer-sagemaker SAGEMAKER_SHARED_S3_BUCKET=bucket TRAINING_JOB_ARN=aaa METRICS_S3_BUCKET=bucket METRICS_S3_OBJECT_KEY=custom_files/metric.json ROBOMAKER_RUN_TYPE=distributed_training TARGET_REWARD_SCORE=100000 NUMBER_OF_EPISODES=20000 ROBOMAKER_SIMULATION_JOB_ACCOUNT_ID=aaa AWS_ROBOMAKER_SIMULATION_JOB_ID=aaa MODEL_METADATA_FILE_S3_KEY=custom_files/model_metadata.json REWARD_FILE_S3_KEY=custom_files/reward.py BUNDLE_CURRENT_PREFIX=/app/robomaker-deepracer/simulation_ws/ GPU_AVAILABLE=True NUMBER_OF_TRIALS=5

alexschultz commented 5 years ago

did you verify that your python is valid using the AWS console?

vsay01 commented 5 years ago

@alexschultz yes i did validate that, and it's fine.

vsay01 commented 5 years ago

@alexschultz I run with the existing reward function and still the same issue which VNC view connection close.

alexschultz commented 5 years ago

Can you try running the script to delete the last training run. You will need to use sudo. I'm wondering if there is a file that is sticking around from a previous attempt and maybe it's preventing everything from starting correctly.

vsay01 commented 5 years ago

@alexschultz I run the delete last training script. when i run start script, i don't see the connection close issue anymore. But no training running. Screenshot from 2019-09-19 17-47-45

vsay01 commented 5 years ago

I run docker logs on sagemaker seems like there's a warning related to memory:

21:M 19 Sep 2019 22:46:04.696 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 21:M 19 Sep 2019 22:46:04.696 # Server initialized 21:M 19 Sep 2019 22:46:04.696 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 21:M 19 Sep 2019 22:46:04.696 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. 21:M 19 Sep 2019 22:46:04.696 * Ready to accept connections

abogaziah commented 5 years ago

Remove VNC
downlaod it from this link and reinstall it https://www.techspot.com/downloads/5760-vnc-viewer.html this worked for me

JustinGuese commented 5 years ago

@alexschultz I run log again on this container ID 3b7ec5214cf2 crr0004/deepracer_robomaker:console This time i see a potential issue:


/app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh: line 8:  1038 Illegal instruction     (core dumped) python3 -m markov.rollout_worker
================================================================================REQUIRED process [agent-9] has died!
process has died [pid 977, exit code 132, cmd /app/robomaker-deepracer/simulation_ws/install/deepracer_simulation/lib/deepracer_simulation/run_rollout_rl_agent.sh __name:=agent __log:=/root/.ros/log/f28e17d2-da38-11e9-8619-0242ac120004/agent-9.log].
log file: /root/.ros/log/f28e17d2-da38-11e9-8619-0242ac120004/agent-9*.log
Initiating shutdown!

Screenshot from 2019-09-18 12-30-20

I am getting exactly the same error on a clean install including all pre-setup steps:

''' The VNC desktop is: f9a0f9c19c48:0


Have you tried the x11vnc '-ncache' VNC client-side pixel caching feature yet?

The scheme stores pixel data offscreen on the VNC viewer side for faster retrieval. It should work with any VNC viewer. Try it by running:

x11vnc -ncache 10 ...

One can also add -ncache_cr for smooth 'copyrect' window motion. More info: http://www.karlrunge.com/x11vnc/faq.html#faq-client-caching

'''

JustinGuese commented 5 years ago

Further error analysis: the log file mentioned in the crr0004/deepracer_robomaker:console - ~/.ros/log/0fa43462-e6f2-11e9-8250-0242ac120004/rosout.log 6.627000000 WARN [spawner:72(shutdown) [topics: /clock, /rosout] Controller Spawner error while taking down controllers: transport error completing service call: receive_once[/racecar/controller_manager/unload_controller]: unexpected error [Errno 4] Interrupted system call

jhart98169 commented 4 years ago

I found a similar post, it seems like this issue is because my CPU does not support AVX instructions which the most recent builds of tensorflow require. If anybody has figured out how to build or get a version of tensorflow working on older CPU's please post. Thanks. This issue has more details https://github.com/crr0004/deepracer/issues/35

shimin-happy commented 4 years ago

Have you solve the problem? I got exact the same error.

shimin-happy commented 4 years ago

The error disappeared after I removed vnc and reinstall it from ubuntu software. But the car does not move.

JustinGuese commented 4 years ago

I guess @jhart98169 was right, tensorflow needs newer CPUs to work, otherwise it throws this error. You can run cat /proc/cpuinfo to see if your processor supports AVX

larsll commented 3 years ago

Closing as the cause of this issue was explained above.