Open kflagg opened 3 years ago
Maybe I should add, my notebook instance is an ml.t2.medium
This line
Failed to delete: /tmp/tmpk8k59j5a/algo-1-eit01 Please remove it manually.
suggests that you might have some zombie docker process.
What is the output of
docker ps -a
Right after attempting the training job:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5c58904fab8a 462105765813.dkr.ecr.us-east-1.amazonaws.com/sagemaker-rl-coach-container:coach-1.0.0-tf-cpu-py3 "bash -m start.sh tr…" 35 seconds ago Exited (1) 1 second ago xz73m5hj42-algo-1-3wa4d
okay, try to remove this container and run the cell for training again
Link to the notebook rl_cartpole_batch_coach.ipynb
Describe the bug I am running the RL cartpole batch learning example notebook and I get an error when running
estimator.fit()
. It appears to complete all 30 training epochs, but then encounters an error. I am doing this in local model using theconda_mxnet_p36
environment, but I get a similar error in the same place when not using local mode. I tried a couple other environments and got the same errors.Here are the last few lines of output:
And here is the traceback:
To Reproduce
rl_cartpole_batch_coach.ipynb
in JupyterLab.local_mode = Flase
tolocal_mode = True
(but I get a similar error when running not in local mode).