Issues with RLEstimator when training

phossen commented 3 years ago

Trying to follow the instructions from the markdown files, I struggle with the RLLibEnv/2_PolicyTraining.ipynb. In the cell which starts the training, the RLEstimator expects three further arguments toolkit, toolki_version, and framework. I fixed this with the following lines:

 toolkit=RLToolkit.COACH,
 toolkit_version='0.11.1',
 framework=RLFramework.TENSORFLOW,

After fixing that, the next problem occurred. When the RLEstimator is calling the train-mabs.py with the parameters. It seems to lack an installation of the requirements.txt in the created docker container. Ray is not installed, but doesn't seem to be the only problem. Output:

Invoking script with the following command:

/usr/bin/python -m train-mabs --additional_configs clip_rewards=True,gamma=0.999,kl_coeff=0.2,lambda=0.9,lr=0.0005,num_sgd_iter=3,sample_batch_size=96,sgd_minibatch_size=256,train_batch_size=9216,vf_clip_param=175.0 --algorithm PPO --iterate_map_size False --map_size 11 --num_agents 4 --num_iters 10 --use_heuristics_action_masks False

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/ml/code/train-mabs.py", line 5, in <module>
    import ray
ModuleNotFoundError: No module named 'ray'
2020-12-22 16:35:23,079 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/usr/bin/python -m train-mabs --additional_configs clip_rewards=True,gamma=0.999,kl_coeff=0.2,lambda=0.9,lr=0.0005,num_sgd_iter=3,sample_batch_size=96,sgd_minibatch_size=256,train_batch_size=9216,vf_clip_param=175.0 --algorithm PPO --iterate_map_size False --map_size 11 --num_agents 4 --num_iters 10 --use_heuristics_action_masks False"

2020-12-22 16:35:50 Uploading - Uploading generated training model
2020-12-22 16:35:50 Failed - Training job failed
ProfilerReport-1608654710: Stopping

jonomon commented 3 years ago

Hi @phossen,

Thank you for reporting this. The toolkit we are using is Ray (i.e., RLToolkit.RAY). This error is due to updating the SageMaker SDK. For a quick fix, please edit the image_name argument to image_uri as shown below.

estimator = RLEstimator(entry_point="train-mabs.py",
                        source_dir='training/training_src',
                        dependencies=["training/common/sagemaker_rl", "inference/inference_src/", "../BattlesnakeGym/"],
                        image_uri=image_name,
                        role=role,
                        train_instance_type=instance_type,
                        train_instance_count=1,
                        output_path=s3_output_path,
                        base_job_name=job_name_prefix,
                        metric_definitions=metric_definitions,
                        hyperparameters={
                            # See train-mabs.py to add additional hyperparameters
                            # Also see ray_launcher.py for the rl.training.* hyperparameters

                            "num_iters": 10,
                            # number of snakes in the gym
                            "num_agents": num_agents,

                            "iterate_map_size": False,
                            "map_size": map_size,
                            "algorithm": algorithm,
                            "additional_configs": additional_config,
                            "use_heuristics_action_masks": False
                        }
                    )

phossen commented 3 years ago

Thank you, fixed it!

awslabs / sagemaker-battlesnake-ai

Issues with RLEstimator when training #30