araffin / rl-baselines-zoo

A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
https://stable-baselines.readthedocs.io/
MIT License
1.13k stars 208 forks source link

Docker repos missing files #7

Closed patterntrade closed 5 years ago

patterntrade commented 5 years ago

Have rl-baselines-zoo, GPU edition, pulled, not built.

Trying to run:

docker run -it --runtime=nvidia --rm --network host --ipc=host --name test --mount src="$(pwd)",target=/root/code/stable-baselines,type=bind araffin/stable-baselines bash -c 'cd /root/code/stable-baselines/ && pytest tests/'

Am running:

sudo docker run --runtime=nvidia -it araffin/stable-baselines bash

Traversing into /root/code/, the directory is empty. It seems there is something wrong about the repository. Similar issues with the rl-zoo image.

I have little experience with docker, so I might well have missed something.

Kind regards

araffin commented 5 years ago

Hello, Did you try running experiments with the shell script?

./run_docker_gpu.sh python train.py --algo ppo2 --env CartPole-v1

I am not 100% that the gpu image works (i have to fix a bug where tf is installed without gpu support), however the cpu image works , it used for continuous integration.

Edit: for the files, that is normal (cf stable baselines doc where the command is explained)

patterntrade commented 5 years ago

The GPU image doesn`t work, error msg like:

... line 35, in from tensorflow.python.keras import backend File "/root/venv/lib/python3.5/site-packages/tensorflow/python/keras/backend/init.py", line 22, in from tensorflow.python.keras._impl.keras.backend import abs ImportError: cannot import name 'abs'

Resolved by in the container: source venv/bin/activate pip install keras pip install --upgrade tensorflow-gpu

Now it works!

Thanks for setting up this repository and the docker images, very helpful.

Merry Christmas!

:-)

araffin commented 5 years ago

Ok, I'll try to update the image then.

araffin commented 5 years ago

Hello again, I updated the docker image, it should be fixed now, can you confirm this?

patterntrade commented 5 years ago

Hi!

Thanks for writing.

Looking at GitHub, neither the docker file nor the docker build file have been changed. Still tried…

docker@sddub:~/Downloads$ docker run -it --runtime=nvidia --rm --network host --ipc=host --name test --mount src="$(pwd)",target=/root/code/stable-baselines,type=bind araffin/stable-baselines bash -c 'cd /root/code/stable-baselines/ && pytest tests/'

================================================ test session starts ================================================= platform linux -- Python 3.5.2, pytest-3.5.1, py-1.7.0, pluggy-0.6.0 rootdir: /root/code/stable-baselines, inifile: plugins: cov-2.6.0

============================================ no tests ran in 0.00 seconds ============================================ ERROR: file not found: tests/

when going by terminal into image, cd

root@cedb2fb4ba37:/# ls bin   dev  home  lib64  mnt  proc  run   srv  tmp  var boot  etc  lib   media  opt  root  sbin  sys  usr root@cedb2fb4ba37:/# cd root root@cedb2fb4ba37:~# ls code  venv root@cedb2fb4ba37:~# cd code root@cedb2fb4ba37:~/code# ls =0.10.9

I’m not sure if I’ve understood this right. Forgive me as I’m a novice to this. Was stable baselines supposed to be on board the docker container? It isn’t there. Was it supposed to be mapped/mounted to a stable baselines implementation on the host machine?

I looked thru the build file, there’s no mention of git stable baselines or similar there, only other dependencies.

Looking forward to hearing from you.

Kind regards

On 18 January 2019 at 00:55:47, Antonin RAFFIN (notifications@github.com) wrote:

Hello again, I updated the docker image, it should be fixed now, can you confirm this?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

araffin commented 5 years ago

Are you using this dockerfile: https://github.com/araffin/rl-baselines-zoo/blob/master/docker/Dockerfile.gpu ?

Stable-Baselines is installed here

The built image: https://hub.docker.com/r/araffin/rl-baselines-zoo

EDIT: Oh, I see, since the beginning you seems to be using stable-baselines docker image instead of the rl zoo docker image.

patterntrade commented 5 years ago

Hi! Thanks for replying so quickly.

Yes, erroneously, I was using stable-baselines. I’ll get the RL-zoo image and try it out.

Still, it means that the documentation of stable-baselines needs to be updated, or the Dockerfiles/images need to be changed.¨

Kind regards

On 23 January 2019 at 20:34:23, Antonin RAFFIN (notifications@github.com) wrote:

Are you using this dockerfile: https://github.com/araffin/rl-baselines-zoo/blob/master/docker/Dockerfile.gpu ?

Stable-Baselines is installed here

The built image: https://hub.docker.com/r/araffin/rl-baselines-zoo

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

araffin commented 5 years ago

The doc is already updated ... cf https://stable-baselines.readthedocs.io/en/master/guide/install.html#using-docker-images " If you are looking for docker images with stable-baselines already installed in it, we recommend using images from RL Baselines Zoo.

Otherwise, the following images contained all the dependencies for stable-baselines but not the stable-baselines package itself. They are made for development. "

patterntrade commented 5 years ago

Hi!

Have twice now tried to run this on Ubuntu 18 desktop, two different installations, once natively, once with Docker (run_docker_gpu.sh). The image I’m using is araffin/rl-baselines-zoo. With both installations I have this issue:

Fatal server error: (EE) Cannot establish any listening sockets - Make sure an X server isn't already running(EE)  ++ seq 1 10

and so forth (see below).

On the first installation, I thought I’d removed some lock files before. I’ve scoured the web for solutions to this issue, haven’t found anything. Would appreciate any ideas on how to address this.

Kind regards

REPOSITORY                 TAG                 IMAGE ID            CREATED             SIZE araffin/rl-baselines-zoo   latest              c799b5127cf3        9 days ago          3.85GB nvidia/cuda                9.0-base            74f5aea45cf6        2 months ago        134MB

sudo bash run_docker_gpu.sh python train.py --algo ppo2 --env CartPole-v1 Executing in the docker (gpu image): python train.py --algo ppo2 --env CartPole-v1

lshw WARNING: you should run this program as super-user. ub-desk                          description: Computer     width: 64 bits     capabilities: smp vsyscall32   -core        description: Motherboard        physical id: 0      -memory           description: System memory           physical id: 0           size: 47GiB      *-cpu           product: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz           vendor: Intel Corp.           physical id: 1           bus info: cpu@0           size: 1199MHz           capacity: 3800MHz           width: 64 bits

On 23 January 2019 at 20:48:33, Antonin RAFFIN (notifications@github.com) wrote:

The doc is already updated ... cf https://stable-baselines.readthedocs.io/en/master/guide/install.html#using-docker-images " If you are looking for docker images with stable-baselines already installed in it, we recommend using images from RL Baselines Zoo.

Otherwise, the following images contained all the dependencies for stable-baselines but not the stable-baselines package itself. They are made for development. "

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

araffin commented 5 years ago

Ok, did you try the cpu image? If it does not work with the cpu image, I'm afraid the problem may come from your machine, because the cpu image is tested at each push on Travic CI. What you are seeing is the entrypoint.sh trying to create a fake X server in order to be able to launch any env that requires one. Btw, why do you have to use sudo? Did you follow the post-installation?

patterntrade commented 5 years ago

Hi!

Thanks for your speedy answer!

Tried the cpu image, same error.

Thanks for the hint about the post installation, did that.

So, it must be something with my system. Will have to figure that out.

Kind regards.

On 27 January 2019 at 18:26:21, Antonin RAFFIN (notifications@github.com) wrote:

Ok, did you try the cpu image? If it does not work with the cpu image, I'm afraid the problem may come from your machine, because the cpu image is tested at each push on Travic CI. What you are seeing is the entrypoint.sh trying to create a fake X server in order to be able to launch any env that requires one. Btw, why do you have to use sudo? Did you follow the post-installation?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

patterntrade commented 5 years ago

Hi again.

I modified entrypoint.sh, rebuilt the GPU image, ran the container: 

ee@ub-desk:~/Desktop/docker$ bash run_docker_gpu.sh python train.py --algo ppo2 --env CartPole-v1 Executing in the docker (gpu image): python train.py --algo ppo2 --env CartPole-v1 Traceback (most recent call last):   File "train.py", line 11, in     from stable_baselines.common import set_global_seeds   File "/root/venv/lib/python3.5/site-packages/stable_baselines/init.py", line 4, in     from stable_baselines.a2c import A2C   File "/root/venv/lib/python3.5/site-packages/stable_baselines/a2c/init.py", line 1, in     from stable_baselines.a2c.a2c import A2C   File "/root/venv/lib/python3.5/site-packages/stable_baselines/a2c/a2c.py", line 5, in     import tensorflow as tf   File "/root/venv/lib/python3.5/site-packages/tensorflow/init.py", line 24, in     from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import   File "/root/venv/lib/python3.5/site-packages/tensorflow/python/init.py", line 63, in     from tensorflow.python.framework.framework_lib import *  # pylint: disable=redefined-builtin   File "/root/venv/lib/python3.5/site-packages/tensorflow/python/framework/framework_lib.py", line 104, in     from tensorflow.python.framework.importer import import_graph_def   File "/root/venv/lib/python3.5/site-packages/tensorflow/python/framework/importer.py", line 32, in     from tensorflow.python.framework import function   File "/root/venv/lib/python3.5/site-packages/tensorflow/python/framework/function.py", line 36, in     from tensorflow.python.ops import resource_variable_ops   File "/root/venv/lib/python3.5/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 35, in     from tensorflow.python.ops import variables   File "/root/venv/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 40, in     class Variable(checkpointable.CheckpointableBase): AttributeError: module 'tensorflow.python.training.checkpointable' has no attribute 'CheckpointableBase'

Pretty sure this is an error in the code, unrelated to the fake X server issue.

Do you have any suggestions?

Kind regards

On 27 January 2019 at 22:37:24, Bjørn A. Helland-Hansen (bjornprivate@runbox.com) wrote:

Hi!

Thanks for your speedy answer!

Tried the cpu image, same error.

Thanks for the hint about the post installation, did that.

So, it must be something with my system. Will have to figure that out.

Kind regards.

On 27 January 2019 at 18:26:21, Antonin RAFFIN (notifications@github.com) wrote:

Ok, did you try the cpu image? If it does not work with the cpu image, I'm afraid the problem may come from your machine, because the cpu image is tested at each push on Travic CI. What you are seeing is the entrypoint.sh trying to create a fake X server in order to be able to launch any env that requires one. Btw, why do you have to use sudo? Did you follow the post-installation?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

patterntrade commented 5 years ago

ADDITIONAL INFO> EXTRACTS FROM BUILD LOG GPU IMAGE

I edited the entrypoint.sh to not try and make a fake X server. Then I can build and run. I dont think the errors in the previous post are due to cartpole trying to display something, its in the code. Might it be an issue with the version of Tensorflow used?

Get:332 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libopenmpi-dev amd64 1.10.2-8ubuntu1 [537 kB]
**debconf: delaying package configuration, since apt-utils is not installed**
Fetched 225 MB in 7min 7s (527 kB/s)

Successfully installed virtualenv-16.3.0
**You are using pip version 8.1.1, however version 19.0.1 is available.**
You should consider upgrading via the 'pip install --upgrade pip' command.
Using base prefix '/usr'
New python executable in /root

**Collecting joblib (from stable-baselines==2.4.0)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /simple/joblib/**
  Downloading https://files.pythonhosted.org/packages/49/d9/4ea194a4c1d0148f9446054b9135f47218c23ccc6f649aeb09fab4c0925c/joblib-0.13.1-py2.py3-none-any.whl (278kB)

Successfully built html5lib
**tensorflow 1.12.0 has requirement tensorboard<1.13.0,>=1.12.0, but you'll have tensorboard 1.8.0 which is incompatible.**
Installing collected packages: html5lib, bleach, tensorboard, tensorflow-gpu

So docker build gave some warnings, but for some reason built the image anyway. I`m not sure that explains the issues in the previous entry or not.

Now, every time I try to build a new Docker image, it just uses local files. Not sure how I can force it to redo from download, or if that has any merit at all.