cathywu / rllab

rllab is a framework for developing and evaluating reinforcement learning algorithms, fully compatible with OpenAI Gym.
Other
1 stars 0 forks source link

Can't run demo mujoco cluster experiments #4

Closed cathywu closed 7 years ago

cathywu commented 7 years ago

Not working:

python3 examples/cluster_gym_mujoco_demo.py

Working:

python3 examples/cluster_demo.py

Also, both versions work locally.

Logs from cluster_demo.py:

sync initiated
log sync initiated
Running in docker
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: c7e0d89735c4
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1080] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1081] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
ESC[32musing seed 1ESC[0m
2017-04-18 00:58:11.498140 UTC | Setting seed to 1
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_17_47_48_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_17_47_48_0001/progress.csv seek() takes 2 positional arguments but 3 were given
Completed 1 file(s) with ~0 file(s) remaining (calculating...)^MCompleted 88 Bytes/88 Bytes with 1 file(s) remaining          ^Mupload: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_17_47_48_0001/variant.json to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_17_47_48_0001/variant.json
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_17_47_48_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_17_47_48_0001/progress.csv seek() takes 2 positional arguments but 3 were given
ESC[32musing seed 1ESC[0m

Logs from cluster_gym_mujoco_demo.py:

sync initiated
log sync initiated
Running in docker
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: 9901f66f8504
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1080] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1081] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
ESC[32musing seed 1ESC[0m
2017-04-18 05:18:20.868390 UTC | Setting seed to 1
ESC[32musing seed 1ESC[0m
/opt/conda/envs/rllab3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")
Traceback (most recent call last):
  File "/root/code/rllab/scripts/run_experiment_lite.py", line 136, in <module>
    run_experiment(sys.argv)
  File "/root/code/rllab/scripts/run_experiment_lite.py", line 119, in run_experiment
    method_call = cloudpickle.loads(base64.b64decode(args.args_data))
  File "/root/code/rllab/sandbox/rocky/tf/policies/gaussian_mlp_policy.py", line 3, in <module>
    from sandbox.rocky.tf.core.layers_powered import LayersPowered
  File "/root/code/rllab/sandbox/rocky/tf/core/layers_powered.py", line 2, in <module>
    import sandbox.rocky.tf.core.layers as L
  File "/root/code/rllab/sandbox/rocky/tf/core/layers.py", line 336, in <module>
    class ParamLayer(Layer):
  File "/root/code/rllab/sandbox/rocky/tf/core/layers.py", line 337, in ParamLayer
    def __init__(self, incoming, num_units, param=tf.zeros_initializer(),
TypeError: zeros_initializer() missing 1 required positional argument: 'shape'
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv seek() takes 2 positional arguments but 3 were given
Completed 1 file(s) with ~0 file(s) remaining (calculating...)^MCompleted 110 Bytes/110 Bytes with 1 file(s) remaining        ^Mupload: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/variant.json to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/variant.json
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv seek() takes 2 positional arguments but 3 were given
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv seek() takes 2 positional arguments but 3 were given
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv seek() takes 2 positional arguments but 3 were given
Completed 1 file(s) with ~0 file(s) remaining (calculating...)^M
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv seek() takes 2 positional arguments but 3 were given
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv seek() takes 2 positional arguments but 3 were given
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv seek() takes 2 positional arguments but 3 were given
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv seek() takes 2 positional arguments but 3 were given
upload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/debug.log to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/debug.log seek() takes 2 positional arguments but 3 were given
Completed 1 file(s) with ~0 file(s) remaining (calculating...)^Mupload failed: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/progress.csv seek() takes 2 positional arguments but 3 were given
Completed 0 Bytes/~110 Bytes with ~1 file(s) remaining (calculating...)^MCompleted 110 Bytes/110 Bytes with 1 file(s) remaining                 ^Mupload: ../../../Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_17_22_12_09_0001/variant.json to s3://cathywu/rllab/experiments/first-exp/first_exp_2017_04_17_22_12_09_0001/variant.json
cathywu commented 7 years ago

Related issues:

This suggests that the tensorflow version is different between my local environment and on the cluster.

Local:

tensorflow.__version__ == '0.12.1'
cathywu commented 7 years ago

From @dementrock:

Yes. Actually the cluster uses TF 0.11. However I have just built new images that uses the 1.0 version which can be used. To use the new image, edit your config_personal.py file, and update docker image by editing DOCKER_IMAGE = "dementrock/rllab3:20170417".

You can test locally that the docker image works by setting the mode to "local_docker" in run_experiment_lite. You will need to install docker first. See https://docs.docker.com/docker-for-mac/ if you are using mac.

cathywu commented 7 years ago

Now testing Docker images locally with DOCKER_IMAGE = "dementrock/rllab3:20170417".

New issue (using examples/cluster_demo.py for fewer confounding factors):

docker run -e "AWS_SECRET_ACCESS_KEY=H9y23vX9G6TQnrU1l3SXBIgknwpk6Jk2HSJouP3N" -e "RLLAB_USE_GPU=False" -e "AWS_ACCESS_KEY_ID=AKIAJSJF3B3IYPZSONBQ" -v /Users/cathywu/.mujoco:/root/.mujoco -v /Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu/data/local/first-exp/first_exp_2017_04_18_00_14_52_0001:/tmp/expt -v /Users/cathywu/Dropbox/PhD/DeepRL-Traffic/rllabcathywu:/root/code/rllab -ti dementrock/rllab3:20170417 /bin/bash -c 'echo "Running in docker"; python /root/code/rllab/scripts/run_experiment_lite.py  --args_data 'gAJjY2xvdWRwaWNrbGUuY2xvdWRwaWNrbGUKX2ZpbGxfZnVuY3Rpb24KcQAoY2Nsb3VkcGlja2xlLmNsb3VkcGlja2xlCl9tYWtlX3NrZWxfZnVuYwpxAWNjbG91ZHBpY2tsZS5jbG91ZHBpY2tsZQpfYnVpbHRpbl90eXBlCnECWAgAAABDb2RlVHlwZXEDhXEEUnEFKEsBSwBLBUsSS0NjX2NvZGVjcwplbmNvZGUKcQZYigAAAHQAAHQBAMKDAADCgwEAfQEAdAIAZAEAfAEAagMAZAIAZBAAwoMAAn0CAHQEAGQBAHwBAGoDAMKDAAF9AwB0BQBkBAB8AQBkBQB8AgBkBgB8AwBkBwBkCABkCQBkCgBkCwBkDABkDQBkDgBkDwB8AABkDwAZwoMACH0EAHwEAGoGAMKDAAABZAAAU3EHWAYAAABsYXRpbjFxCIZxCVJxCihOWAgAAABlbnZfc3BlY3ELWAwAAABoaWRkZW5fc2l6ZXNxDEsgWAMAAABlbnZxDVgGAAAAcG9saWN5cQ5YCAAAAGJhc2VsaW5lcQ9YCgAAAGJhdGNoX3NpemVxEE2gD1gPAAAAbWF4X3BhdGhfbGVuZ3RocRFLZFgFAAAAbl9pdHJxEksoWAgAAABkaXNjb3VudHETRz/vrhR64UeuWAkAAABzdGVwX3NpemVxFEsgSyCGcRV0cRYoWAkAAABub3JtYWxpemVxF1gLAAAAQ2FydHBvbGVFbnZxGFgRAAAAR2F1c3NpYW5NTFBQb2xpY3lxGVgEAAAAc3BlY3EaWBUAAABMaW5lYXJGZWF0dXJlQmFzZWxpbmVxG1gEAAAAVFJQT3EcWAUAAAB0cmFpbnEddHEeKFgBAAAAdnEfaA1oDmgPWAQAAABhbGdvcSB0cSFYGAAAAGV4YW1wbGVzL2NsdXN0ZXJfZGVtby5weXEiWAgAAABydW5fdGFza3EjSwpoBlgeAAAAAAEPAgYBCQIJAxICBgEGAQYBBgEGAQYBBgEGAQ0EcSRoCIZxJVJxJikpdHEnUnEoXXEpfXEqh3ErUnEsfXEtKGgcY3JsbGFiLmFsZ29zLnRycG8KVFJQTwpxLmgZY3JsbGFiLnBvbGljaWVzLmdhdXNzaWFuX21scF9wb2xpY3kKR2F1c3NpYW5NTFBQb2xpY3kKcS9oGGNybGxhYi5lbnZzLmJveDJkLmNhcnRwb2xlX2VudgpDYXJ0cG9sZUVudgpxMGgbY3JsbGFiLmJhc2VsaW5lcy5saW5lYXJfZmVhdHVyZV9iYXNlbGluZQpMaW5lYXJGZWF0dXJlQmFzZWxpbmUKcTFoF2NybGxhYi5lbnZzLm5vcm1hbGl6ZWRfZW52Ck5vcm1hbGl6ZWRFbnYKcTJ1Tn1xM3RSLg=='  --log_dir '/tmp/expt'  --variant_data 'gAN9cQAoWAQAAABzZWVkcQFLAVgJAAAAc3RlcF9zaXplcQJHP4R64UeuFHtYCAAAAGV4cF9uYW1lcQNYIgAAAGZpcnN0X2V4cF8yMDE3XzA0XzE4XzAwXzE0XzUyXzAwMDFxBHUu'  --seed '1'  --exp_name 'first_exp_2017_04_18_00_14_52_0001'  --snapshot_mode 'last'  --n_parallel '1'  --use_cloudpickle 'True'; sleep 120'
Running in docker
> /root/code/rllab/scripts/run_experiment_lite.py(8)<module>()
      7 ipdb.set_trace()
----> 8 from rllab.misc.ext import is_iterable, set_seed
      9 from rllab.misc.instrument import concretize

ipdb> from rllab.misc.ext import is_iterable, set_seed
*** ImportError: No module named 'rllab.misc'
ipdb> import rllab.misc
*** ImportError: No module named 'rllab.misc'
ipdb> import rllab.rllab.misc
ipdb> import rllab.rllab.misc.ext
*** ImportError: No module named 'rllab.misc'
ipdb> sys.path

sys.path: ['', '/root/code/rllab/scripts', '/root/code/rllab3', '/root/code', '/opt/conda/envs/rllab3/lib/python35.zip', '/opt/conda/envs/rllab3/lib/python3.5', '/opt/conda/envs/rllab3/lib/python3.5/plat-linux', '/opt/conda/envs/rllab3/lib/python3.5/lib-dynload', '/opt/conda/envs/rllab3/lib/python3.5/site-packages', '/opt/conda/envs/rllab3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg', '/opt/conda/envs/rllab3/lib/python3.5/site-packages/torchvision-0.1.8-py3.5.egg', '.', '/opt/conda/envs/rllab3/lib/python3.5/site-packages/IPython/extensions', '/root/.ipython']

cathywu commented 7 years ago

Temporary hack in run_experiment_lite.py:

# FIXME(cathywu) HACK for missing in path in 20170417 docker build
sys.path.append("/root/code/rllab")

Status: examples/cluster_demo.py working again, examples/cluster_gym_mujoco_demo.py not working.

New issue: mujoco not installed properly.

Traceback (most recent call last):
  File "/root/code/rllab/scripts/run_experiment_lite.py", line 138, in <module>
    run_experiment(sys.argv)
  File "/root/code/rllab/scripts/run_experiment_lite.py", line 122, in run_experiment
    method_call(variant_data)
  File "examples/cluster_gym_mujoco_demo.py", line 26, in run_task
  File "/root/code/rllab/rllab/envs/gym_env.py", line 68, in __init__
    env = gym.envs.make(env_name)
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/gym/envs/registration.py", line 161, in make
    return registry.make(id)
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/gym/envs/registration.py", line 119, in make
    env = spec.make()
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/gym/envs/registration.py", line 85, in make
    cls = load(self._entry_point)
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/gym/envs/registration.py", line 17, in load
    result = entry_point.load(False)
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/pkg_resources/__init__.py", line 2258, in load
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/pkg_resources/__init__.py", line 2264, in resolve
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/gym/envs/mujoco/__init__.py", line 1, in <module>
    from gym.envs.mujoco.mujoco_env import MujocoEnv
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/gym/envs/mujoco/mujoco_env.py", line 11, in <module>
    import mujoco_py
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/mujoco_py/__init__.py", line 2, in <module>
    init_config()
  File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/mujoco_py/config.py", line 37, in init_config
    raise error.MujocoDependencyError('Found your MuJoCo license key but not binaries. Please put your binaries into ~/.mujoco/mjpro131 or set MUJOCO_PY_MJPRO_PATH. Follow the instructions on https://github.com/openai/mujoco-py for setup.')
mujoco_py.error.MujocoDependencyError: Found your MuJoCo license key but not binaries. Please put your binaries into ~/.mujoco/mjpro131 or set MUJOCO_PY_MJPRO_PATH. Follow the instructions on https://github.com/openai/mujoco-py for setup.
cathywu commented 7 years ago

Resolution: need linux version of mujoco.

Temporary hack in rllab/config.py:

MUJOCO_KEY_PATH = "/Users/cathywu/Dropbox/PhD/DeepRL-Traffic/mujoco_linux"  # for docker / ec2

Status: examples/cluster_demo.py, examples/cluster_gym_mujoco_demo.py both working.

cathywu commented 7 years ago

Bonus: examples/cluster_walker_tf_comparison.py also works with mode="local_docker".

cathywu commented 7 years ago

Resolved by #5.