aws-deepracer-community / deepracer-core

A repository binding together everything needed for DeepRacer local.
259 stars 113 forks source link

S3UploadFailedError #91

Closed viX-shaw closed 3 years ago

viX-shaw commented 4 years ago

Looking for config file: /home/vivek/.sagemaker/config.yaml Model checkpoints and other metadata will be stored at: s3://bucket/rl-deepracer-sagemaker Uploading to s3://bucket/rl-deepracer-sagemaker WARNING:sagemaker:Parameter image_name is specified, toolkit, toolkit_version, framework are going to be ignored when choosing the image. s3.ServiceResource() Using provided s3_client INFO:sagemaker:Creating training-job with name: rl-deepracer-sagemaker Starting training job Using /media/vivek/42B836B5B836A6F7/H/DeepRacer/robo/container for container temp files Using /media/vivek/42B836B5B836A6F7/H/DeepRacer/robo/container for container temp files Trying to launch image: crr0004/sagemaker-rl-tensorflow:console_v1.1 Creating tmp6hao6fy1_algo-1-cf199_1 ... done Attaching to tmp6hao6fy1_algo-1-cf199_1 algo-1-cf199_1 | $1 is train algo-1-cf199_1 | In train start.sh algo-1-cf199_1 | Current host is "algo-1-cf199" algo-1-cf199_1 | Compiling changehostname.c algo-1-cf199_1 | Done Compiling changehostname.c algo-1-cf199_1 | 15:C 05 Feb 2020 11:05:17.031 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo algo-1-cf199_1 | 15:C 05 Feb 2020 11:05:17.203 # Redis version=5.0.5, bits=64, commit=00000000, modified=0, pid=15, just started algo-1-cf199_1 | 15:C 05 Feb 2020 11:05:17.203 # Configuration loaded algo-1-cf1991 | ._
algo-1-cf1991 | .-__ ''-._ algo-1-cf199_1 | _.- .. ''-. Redis 5.0.5 (00000000/0) 64 bit algo-1-cf1991 | .-`` .-.\/ ., ''-.
algo-1-cf199_1 | ( ' , .-|, ) Running in standalone mode algo-1-cf1991 | |`-.-...- ...-.`-._|' _.-'| Port: 6379 algo-1-cf1991 | | `-. ._ / _.-' | PID: 15 algo-1-cf199_1 |-. `-. -./ _.-' _.-' algo-1-cf199_1 | |-.`-. `-..-' .-'.-'|
algo-1-cf1991 | | `-.-._ _.-'_.-' | http://redis.io algo-1-cf199_1 |-. `-.-.__.-'_.-' _.-' algo-1-cf199_1 | |-.`-. -.__.-' _.-'_.-'| algo-1-cf199_1 | |-.`-. .-'.-' |
algo-1-cf1991 | `-. -._-..-'.-' .-'
algo-1-cf1991 | `-. `-.
.-' _.-'
algo-1-cf1991 | `-. _.-'
algo-1-cf199_1 | `-..-'
algo-1-cf199_1 | algo-1-cf199_1 | 15:M 05 Feb 2020 11:05:17.206 # WARNING: The TCP backlog setting of 512 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. algo-1-cf199_1 | 15:M 05 Feb 2020 11:05:17.206 # Server initialized algo-1-cf199_1 | 15:M 05 Feb 2020 11:05:17.206 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. algo-1-cf199_1 | 15:M 05 Feb 2020 11:05:17.206 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. algo-1-cf199_1 | 15:M 05 Feb 2020 11:05:17.209 * Ready to accept connections algo-1-cf199_1 | 05/02/2020 11:05:17 passing arg to libvncserver: -rfbport algo-1-cf199_1 | 05/02/2020 11:05:17 passing arg to libvncserver: 5800 algo-1-cf199_1 | 05/02/2020 11:05:17 x11vnc version: 0.9.13 lastmod: 2011-08-10 pid: 16 algo-1-cf199_1 | 05/02/2020 11:05:17 algo-1-cf199_1 | 05/02/2020 11:05:17 wait_for_client: WAIT:0 algo-1-cf199_1 | 05/02/2020 11:05:17 algo-1-cf199_1 | 05/02/2020 11:05:17 initialize_screen: fb_depth/fb_bpp/fb_Bpl 24/32/2560 algo-1-cf199_1 | 05/02/2020 11:05:17 algo-1-cf199_1 | 05/02/2020 11:05:17 Listening for VNC connections on TCP port 5800 algo-1-cf199_1 | 05/02/2020 11:05:17 Listening for VNC connections on TCP6 port 5900 algo-1-cf199_1 | 05/02/2020 11:05:17 Listening also on IPv6 port 5800 (socket 6) algo-1-cf199_1 | 05/02/2020 11:05:17 algo-1-cf199_1 | algo-1-cf199_1 | The VNC desktop is: 18f715a8d423:5800 algo-1-cf199_1 | 05/02/2020 11:05:17 possible alias: 18f715a8d423::5800 algo-1-cf199_1 | PORT=5800 algo-1-cf199_1 | JWM: warning: /etc/jwm/system.jwmrc[6]: invalid include: /etc/jwm/debian-menu algo-1-cf199_1 | 2020-02-05 11:05:34,133 sagemaker-containers INFO Imported framework sagemaker_tensorflow_container.training algo-1-cf199_1 | 2020-02-05 11:05:34,154 sagemaker-containers INFO No GPUs detected (normal if no gpus installed) algo-1-cf199_1 | 2020-02-05 11:05:34,412 sagemaker-containers INFO No GPUs detected (normal if no gpus installed) algo-1-cf199_1 | 2020-02-05 11:05:34,450 sagemaker-containers INFO No GPUs detected (normal if no gpus installed) algo-1-cf199_1 | 2020-02-05 11:05:34,477 sagemaker-containers INFO Invoking user script algo-1-cf199_1 | algo-1-cf199_1 | Training Env: algo-1-cf199_1 | algo-1-cf199_1 | { algo-1-cf199_1 | "additional_framework_parameters": { algo-1-cf199_1 | "sagemaker_estimator": "RLEstimator" algo-1-cf199_1 | }, algo-1-cf199_1 | "channel_input_dirs": {}, algo-1-cf199_1 | "current_host": "algo-1-cf199", algo-1-cf199_1 | "framework_module": "sagemaker_tensorflow_container.training:main", algo-1-cf199_1 | "hosts": [ algo-1-cf199_1 | "algo-1-cf199" algo-1-cf199_1 | ], algo-1-cf199_1 | "hyperparameters": { algo-1-cf199_1 | "s3_bucket": "bucket", algo-1-cf199_1 | "s3_prefix": "rl-deepracer-sagemaker", algo-1-cf199_1 | "aws_region": "us-east-1", algo-1-cf199_1 | "model_metadata_s3_key": "s3://bucket/custom_files/model_metadata.json", algo-1-cf199_1 | "RLCOACH_PRESET": "deepracer", algo-1-cf199_1 | "batch_size": 2 algo-1-cf199_1 | }, algo-1-cf199_1 | "input_config_dir": "/opt/ml/input/config", algo-1-cf199_1 | "input_data_config": {}, algo-1-cf199_1 | "input_dir": "/opt/ml/input", algo-1-cf199_1 | "is_master": true, algo-1-cf199_1 | "job_name": "rl-deepracer-sagemaker", algo-1-cf199_1 | "log_level": 20, algo-1-cf199_1 | "master_hostname": "algo-1-cf199", algo-1-cf199_1 | "model_dir": "/opt/ml/model", algo-1-cf199_1 | "module_dir": "s3://bucket/rl-deepracer-sagemaker/source/sourcedir.tar.gz", algo-1-cf199_1 | "module_name": "training_worker", algo-1-cf199_1 | "network_interface_name": "eth0", algo-1-cf199_1 | "num_cpus": 4, algo-1-cf199_1 | "num_gpus": 0, algo-1-cf199_1 | "output_data_dir": "/opt/ml/output/data", algo-1-cf199_1 | "output_dir": "/opt/ml/output", algo-1-cf199_1 | "output_intermediate_dir": "/opt/ml/output/intermediate", algo-1-cf199_1 | "resource_config": { algo-1-cf199_1 | "current_host": "algo-1-cf199", algo-1-cf199_1 | "hosts": [ algo-1-cf199_1 | "algo-1-cf199" algo-1-cf199_1 | ] algo-1-cf199_1 | }, algo-1-cf199_1 | "user_entry_point": "training_worker.py" algo-1-cf199_1 | } algo-1-cf199_1 | algo-1-cf199_1 | Environment variables: algo-1-cf199_1 | algo-1-cf199_1 | SM_HOSTS=["algo-1-cf199"] algo-1-cf199_1 | SM_NETWORK_INTERFACE_NAME=eth0 algo-1-cf199_1 | SM_HPS={"RLCOACH_PRESET":"deepracer","aws_region":"us-east-1","batch_size":2,"model_metadata_s3_key":"s3://bucket/custom_files/model_metadata.json","s3_bucket":"bucket","s3_prefix":"rl-deepracer-sagemaker"} algo-1-cf199_1 | SM_USER_ENTRY_POINT=training_worker.py algo-1-cf199_1 | SM_FRAMEWORK_PARAMS={"sagemaker_estimator":"RLEstimator"} algo-1-cf199_1 | SM_RESOURCE_CONFIG={"current_host":"algo-1-cf199","hosts":["algo-1-cf199"]} algo-1-cf199_1 | SM_INPUT_DATA_CONFIG={} algo-1-cf199_1 | SM_OUTPUT_DATA_DIR=/opt/ml/output/data algo-1-cf199_1 | SM_CHANNELS=[] algo-1-cf199_1 | SM_CURRENT_HOST=algo-1-cf199 algo-1-cf199_1 | SM_MODULE_NAME=training_worker algo-1-cf199_1 | SM_LOG_LEVEL=20 algo-1-cf199_1 | SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main algo-1-cf199_1 | SM_INPUT_DIR=/opt/ml/input algo-1-cf199_1 | SM_INPUT_CONFIG_DIR=/opt/ml/input/config algo-1-cf199_1 | SM_OUTPUT_DIR=/opt/ml/output algo-1-cf199_1 | SM_NUM_CPUS=4 algo-1-cf199_1 | SM_NUM_GPUS=0 algo-1-cf199_1 | SM_MODEL_DIR=/opt/ml/model algo-1-cf199_1 | SM_MODULE_DIR=s3://bucket/rl-deepracer-sagemaker/source/sourcedir.tar.gz algo-1-cf199_1 | SM_TRAINING_ENV={"additional_framework_parameters":{"sagemaker_estimator":"RLEstimator"},"channel_input_dirs":{},"current_host":"algo-1-cf199","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1-cf199"],"hyperparameters":{"RLCOACH_PRESET":"deepracer","aws_region":"us-east-1","batch_size":2,"model_metadata_s3_key":"s3://bucket/custom_files/model_metadata.json","s3_bucket":"bucket","s3_prefix":"rl-deepracer-sagemaker"},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"rl-deepracer-sagemaker","log_level":20,"master_hostname":"algo-1-cf199","model_dir":"/opt/ml/model","module_dir":"s3://bucket/rl-deepracer-sagemaker/source/sourcedir.tar.gz","module_name":"training_worker","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1-cf199","hosts":["algo-1-cf199"]},"user_entry_point":"training_worker.py"} algo-1-cf199_1 | SM_USER_ARGS=["--RLCOACH_PRESET","deepracer","--aws_region","us-east-1","--batch_size","2","--model_metadata_s3_key","s3://bucket/custom_files/model_metadata.json","--s3_bucket","bucket","--s3_prefix","rl-deepracer-sagemaker"] algo-1-cf199_1 | SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate algo-1-cf199_1 | SM_HP_S3_BUCKET=bucket algo-1-cf199_1 | SM_HP_S3_PREFIX=rl-deepracer-sagemaker algo-1-cf199_1 | SM_HP_AWS_REGION=us-east-1 algo-1-cf199_1 | SM_HP_MODEL_METADATA_S3_KEY=s3://bucket/custom_files/model_metadata.json algo-1-cf199_1 | SM_HP_RLCOACH_PRESET=deepracer algo-1-cf199_1 | SM_HP_BATCH_SIZE=2 algo-1-cf199_1 | algo-1-cf199_1 | Invoking script with the following command: algo-1-cf199_1 | algo-1-cf199_1 | /usr/bin/python3.6 training_worker.py --RLCOACH_PRESET deepracer --aws_region us-east-1 --batch_size 2 --model_metadata_s3_key s3://bucket/custom_files/model_metadata.json --s3_bucket bucket --s3_prefix rl-deepracer-sagemaker algo-1-cf199_1 | algo-1-cf199_1 | algo-1-cf199_1 | Initializing SageS3Client... algo-1-cf199_1 | Successfully downloaded model metadata from custom_files/model_metadata.json. algo-1-cf199_1 | Using the following hyper-parameters algo-1-cf199_1 | { algo-1-cf199_1 | "batch_size": 2, algo-1-cf199_1 | "beta_entropy": 0.01, algo-1-cf199_1 | "discount_factor": 0.999, algo-1-cf199_1 | "e_greedy_value": 0.05, algo-1-cf199_1 | "epsilon_steps": 10000, algo-1-cf199_1 | "exploration_type": "categorical", algo-1-cf199_1 | "loss_type": "mean squared error", algo-1-cf199_1 | "lr": 0.0003, algo-1-cf199_1 | "num_episodes_between_training": 20, algo-1-cf199_1 | "num_epochs": 10, algo-1-cf199_1 | "stack_size": 1, algo-1-cf199_1 | "term_cond_avg_score": 100000.0, algo-1-cf199_1 | "term_cond_max_episodes": 100000 algo-1-cf199_1 | } algo-1-cf199_1 | Uploaded hyperparameters.json to S3 algo-1-cf199_1 | Uploaded IP address information to S3: 172.18.0.2 algo-1-cf199_1 | ## Creating graph - name: BasicRLGraphManager algo-1-cf199_1 | Loaded action space from file: [{'steering_angle': -30, 'speed': 0.4, 'index': 0}, {'steering_angle': -30, 'speed': 0.8, 'index': 1}, {'steering_angle': -15, 'speed': 0.4, 'index': 2}, {'steering_angle': -15, 'speed': 0.8, 'index': 3}, {'steering_angle': 0, 'speed': 0.4, 'index': 4}, {'steering_angle': 0, 'speed': 0.8, 'index': 5}, {'steering_angle': 15, 'speed': 0.4, 'index': 6}, {'steering_angle': 15, 'speed': 0.8, 'index': 7}, {'steering_angle': 30, 'speed': 0.4, 'index': 8}, {'steering_angle': 30, 'speed': 0.8, 'index': 9}] algo-1-cf199_1 | ## Creating agent - name: agent algo-1-cf199_1 | Checkpoint> Saving in path=['./checkpoint/0_Step-0.ckpt'] algo-1-cf199_1 | {"simapp_exception": {"version": "1.0", "date": "2020-02-05 11:06:02.764148", "function": "save_to_store", "message": "Exception [Failed to upload /opt/ml/code/checkpoint/0_Step-0.ckpt.data-00000-of-00001 to bucket/rl-deepracer-sagemaker/model/0_Step-0.ckpt.data-00000-of-00001: An error occurred (NoSuchUpload) when calling the CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.] occured while uploading files on S3 for checkpoint", "exceptionType": "s3_datastore.exceptions", "eventType": "system_error", "errorCode": "500"}} algo-1-cf199_1 | Traceback (most recent call last): algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/boto3/s3/transfer.py", line 279, in upload_file algo-1-cf199_1 | future.result() algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/s3transfer/futures.py", line 106, in result algo-1-cf199_1 | return self._coordinator.result() algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/s3transfer/futures.py", line 265, in result algo-1-cf199_1 | raise self._exception algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/s3transfer/tasks.py", line 126, in
call__ algo-1-cf199_1 | return self._execute_main(kwargs) algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/s3transfer/tasks.py", line 150, in _execute_main algo-1-cf199_1 | return_value = self._main(kwargs) algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/s3transfer/tasks.py", line 364, in _main algo-1-cf199_1 | extra_args) algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call algo-1-cf199_1 | return self._make_api_call(operation_name, kwargs) algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 661, in _make_api_call algo-1-cf199_1 | raise error_class(parsed_response, operation_name) algo-1-cf199_1 | botocore.errorfactory.NoSuchUpload: An error occurred (NoSuchUpload) when calling the CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. algo-1-cf199_1 | algo-1-cf199_1 | During handling of the above exception, another exception occurred: algo-1-cf199_1 | algo-1-cf199_1 | Traceback (most recent call last): algo-1-cf199_1 | File "training_worker.py", line 252, in algo-1-cf199_1 | main() algo-1-cf199_1 | File "training_worker.py", line 247, in main algo-1-cf199_1 | memory_backend_params=memory_backend_params algo-1-cf199_1 | File "training_worker.py", line 71, in training_worker algo-1-cf199_1 | graph_manager.save_checkpoint() algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py", line 631, in save_checkpoint algo-1-cf199_1 | data_store.save_to_store() algo-1-cf199_1 | File "/opt/ml/code/markov/s3_boto_data_store.py", line 146, in save_to_store algo-1-cf199_1 | raise e algo-1-cf199_1 | File "/opt/ml/code/markov/s3_boto_data_store.py", line 102, in save_to_store algo-1-cf199_1 | Key=self._get_s3_key(rel_name)) algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/boto3/s3/inject.py", line 131, in upload_file algo-1-cf199_1 | extra_args=ExtraArgs, callback=Callback) algo-1-cf199_1 | File "/usr/local/lib/python3.6/dist-packages/boto3/s3/transfer.py", line 287, in upload_file algo-1-cf199_1 | filename, '/'.join([bucket, key]), e)) algo-1-cf199_1 | boto3.exceptions.S3UploadFailedError: Failed to upload /opt/ml/code/checkpoint/0_Step-0.ckpt.data-00000-of-00001 to bucket/rl-deepracer-sagemaker/model/0_Step-0.ckpt.data-00000-of-00001: An error occurred (NoSuchUpload) when calling the CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. algo-1-cf199_1 | 2020-02-05 11:06:03,975 sagemaker-containers ERROR ExecuteUserScriptError: algo-1-cf199_1 | Command "/usr/bin/python3.6 training_worker.py --RLCOACH_PRESET deepracer --aws_region us-east-1 --batch_size 2 --model_metadata_s3_key s3://bucket/custom_files/model_metadata.json --s3_bucket bucket --s3_prefix rl-deepracer-sagemaker"

I can view the whole checkpoint file in local minio server, but still raises exception.

breadcentric commented 3 years ago

We've moved on to using https://github.com/aws-deepracer-community/deepracer-for-cloud for running deepracer in local env

Closing