aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.14k stars 6.78k forks source link

[Bug Report] rl_cartpole_batch_coach example gives an error when training #2208

Open kflagg opened 3 years ago

kflagg commented 3 years ago

Link to the notebook rl_cartpole_batch_coach.ipynb

Describe the bug I am running the RL cartpole batch learning example notebook and I get an error when running estimator.fit(). It appears to complete all 30 training epochs, but then encounters an error. I am doing this in local model using the conda_mxnet_p36 environment, but I get a similar error in the same place when not using local mode. I tried a couple other environments and got the same errors.

Here are the last few lines of output:

02hifyb2ec-algo-1-eit01 | Training Batch RL Models> Epoch=27, Reward Model Loss=9.223778806334698e-07
02hifyb2ec-algo-1-eit01 | Training Batch RL Models> Epoch=28, Reward Model Loss=8.255876659968953e-07
02hifyb2ec-algo-1-eit01 | Training Batch RL Models> Epoch=29, Reward Model Loss=7.376795045304849e-07
02hifyb2ec-algo-1-eit01 | 2021-05-07 15:27:07,541 sagemaker-containers ERROR    ExecuteUserScriptError:
02hifyb2ec-algo-1-eit01 | Command "/usr/bin/python train-coach.py --RLCOACH_PRESET preset-cartpole-ddqnbcq --save_model 1"
02hifyb2ec-algo-1-eit01 exited with code 1
1
Aborting on container exit...

And here is the traceback:

Failed to delete: /tmp/tmpk8k59j5a/algo-1-eit01 Please remove it manually.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py in train(self, input_data_config, output_data_config, hyperparameters, job_name)
    237         try:
--> 238             _stream_output(process)
    239         except RuntimeError as e:

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py in _stream_output(process)
    893     if exit_code != 0:
--> 894         raise RuntimeError("Process exited with code: %s" % exit_code)
    895 

RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    667         self._prepare_for_training(job_name=job_name)
    668 
--> 669         self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
    670         self.jobs.append(self.latest_training_job)
    671         if wait:

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs, experiment_config)
   1430         """
   1431         train_args = cls._get_train_args(estimator, inputs, experiment_config)
-> 1432         estimator.sagemaker_session.train(**train_args)
   1433 
   1434         return cls(estimator.sagemaker_session, estimator._current_job_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image_uri, algorithm_arn, encrypt_inter_container_traffic, use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics, profiler_rule_configs, profiler_config, environment)
    565         LOGGER.info("Creating training-job with name: %s", job_name)
    566         LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 567         self.sagemaker_client.create_training_job(**train_request)
    568 
    569     def _get_train_request(  # noqa: C901

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/local_session.py in create_training_job(self, TrainingJobName, AlgorithmSpecification, OutputDataConfig, ResourceConfig, InputDataConfig, **kwargs)
    184         hyperparameters = kwargs["HyperParameters"] if "HyperParameters" in kwargs else {}
    185         logger.info("Starting training job")
--> 186         training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
    187 
    188         LocalSagemakerClient._training_jobs[TrainingJobName] = training_job

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in start(self, input_data_config, output_data_config, hyperparameters, job_name)
    219 
    220         self.model_artifacts = self.container.train(
--> 221             input_data_config, output_data_config, hyperparameters, job_name
    222         )
    223         self.end_time = datetime.datetime.now()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py in train(self, input_data_config, output_data_config, hyperparameters, job_name)
    241             # which contains the exit code and append the command line to it.
    242             msg = "Failed to run: %s, %s" % (compose_command, str(e))
--> 243             raise RuntimeError(msg)
    244         finally:
    245             artifacts = self.retrieve_artifacts(compose_data, output_data_config, job_name)

RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpk8k59j5a/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

To Reproduce

kflagg commented 3 years ago

Maybe I should add, my notebook instance is an ml.t2.medium

hongshanli23 commented 3 years ago

This line

Failed to delete: /tmp/tmpk8k59j5a/algo-1-eit01 Please remove it manually.

suggests that you might have some zombie docker process.

What is the output of

docker ps -a
kflagg commented 3 years ago

Right after attempting the training job:

docker ps -a
CONTAINER ID        IMAGE                                                                                              COMMAND                  CREATED             STATUS                    PORTS               NAMES
5c58904fab8a        462105765813.dkr.ecr.us-east-1.amazonaws.com/sagemaker-rl-coach-container:coach-1.0.0-tf-cpu-py3   "bash -m start.sh tr…"   35 seconds ago      Exited (1) 1 second ago                       xz73m5hj42-algo-1-3wa4d
hongshanli23 commented 3 years ago

okay, try to remove this container and run the cell for training again