Running on Google Cloud

jhpenger commented 5 years ago

I am new to this, does anyone have instructions on how to setup Gibson on a GCloud instance? and how to view the web UI for simulations on a GCloud instance.

Much appreciated.

fxia22 commented 5 years ago

Here is a guide to run on GCP: https://github.com/StanfordVL/GibsonEnv/wiki/Setup-guide-on-GCP

Let me know if you have more questions.

jhpenger commented 5 years ago

Thanks for the link, There is a typo under the section "Running", docker run --runtime=nvidia -ti -v <dataset root>:/root/mount/gibson/gibson/assets/data set -p 5001:5001 xf1280/gibson:0.3.1. (extra space between data set. Also, I ran into an error running python examples/demo/benchmark_fps.py here is what i got:

root@a98b1ef4a6f2:~/mount/gibson# python examples/demo/benchmark_fps.py
pybullet build time: Sep 27 2018 00:17:23
pygame 1.9.4
Hello from the pygame community. https://www.pygame.org/contribute.html
/root/mount/gibson/examples/demo/../configs/benchmark.yaml
Traceback (most recent call last):
  File "examples/demo/benchmark_fps.py", line 19, in <module>
    env = HuskyNavigateEnv(config=args.config, gpu_idx=args.gpu)
  File "/root/mount/gibson/gibson/envs/husky_env.py", line 38, in __init__
    tracking_camera=tracking_camera)
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 333, in __init__
    self.model_path = get_model_path(self.model_id)
  File "/root/mount/gibson/gibson/data/datasets.py", line 52, in get_model_path
    assert (model_id in os.listdir(data_path)) or model_id == 'stadium', "Model {} does not exist".format(model_id)
AssertionError: Model space7 does not exist
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/miniconda/envs/py35/lib/python3.5/site-packages/gym/utils/closer.py", line 67, in close
    closeable.close()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 487, in _close
    self.r_camera_mul.terminate()
AttributeError: 'HuskyNavigateEnv' object has no attribute 'r_camera_mul'
Exception ignored in: <bound method Env.__del__ of <gibson.envs.husky_env.HuskyNavigateEnv object at 0x7fda12beb828>>
Traceback (most recent call last):
  File "/miniconda/envs/py35/lib/python3.5/site-packages/gym/core.py", line 203, in __del__
    self.close()
  File "/root/mount/gibson/gibson/envs/env_modalities.py", line 487, in _close
    self.r_camera_mul.terminate()
AttributeError: 'HuskyNavigateEnv' object has no attribute 'r_camera_mul'

Any idea what went wrong?

fxia22 commented 5 years ago

Thanks for pointing this out. It looks like there is no space7 in your dataset. Can you check if the dataset folder is correctly mounted? Can you find space7 in /root/mount/gibson/gibson/assets/dataset inside docker?

jhpenger commented 5 years ago

Thanks, my dataset folder in docker is indeed empty. I did the volume correctly. It says it is being written to when I attempt to delete dataset. Is it normal that the files are being copied in at such a slow pace? (i'll just download and unzip from docker if it still doesn't work). On another note, do you have any experience/insights with running Gibson on a cluster? Thanks,

fxia22 commented 5 years ago

The mount should happen pretty fast. I am not sure the reason why it is mounted so slow. Maybe using SSD disk when creating VM will be helpful? Also I used the absolute path of dataset on the host when mounting the volume.

Yes downloading and then unzipping from docker container also works.

Yes, you can run gibson on a cluster. Can you be a bit more specific about running Gibson on a cluster? What exactly are you looking for, do you want to run multiple instances of gibson on the same machine? Do you want to run gibson on multiple nodes?

jhpenger commented 5 years ago

Do you want to run multiple instances of gibson on the same machine?

Yes, we would like to do this. Specifically we will do this when running small experiments to sanity check our code. On the main Gibson project page, under the Framerate section, I see 1-8 processes listed with Gibson running at different resolutions. That means up to 8 instances of Gibson are running concurrently on a single machine + single GPU? If this is correct, have you used multiple Gibson instances for distributed PPO training? Related to this, can you please explain the difference between frame sync vs. episode sync?

Do you want to run gibson on multiple nodes?

Yes, we would eventually also like to use GCP to do multiple-node multiple-instance distributed RL training.

Thanks!

fxia22 commented 5 years ago

For running multiple instances on one machine, you can refer to an example here. This script uses MPI and distributes n instances on n GPUs (each GPU is paired with 3 CPU cores for good scaling performance). And this is an example of frame sync, as it does synchronization at each frame. You can change the script a bit to perform episode sync, where you collect the entire episode in each process, and then gather it to the master process. episode sync will be slightly faster.

We haven't tried distributed PPO training, but it should be possible with the MPI script I believe. Also, I haven't tried multiple nodes but you are welcome to extend the above script to do it.

jhpenger commented 5 years ago

is gym.make() supported for creating the environments? I don't really get comment in init.py about gym.make()

fxia22 commented 5 years ago

@jhpenger that's deprecated and using env = SomeEnvClass(config = <config file path>) is the recommended way of creating environment. We didn't end up using gym.make because if we do that, we will need to create so many environments as there are combinatorial numbers of environments to create, varying scene, agent, modality, etc..

jhpenger commented 5 years ago

@fxia22 I want to understand what the following output is for when creating an environment in Gibson is it for rendering the environment?

Processing the data:
Total 1 scenes 0 train 1 test
Indexing
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]number of devices found 1
Loaded EGL 1.5 after reload.
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=GeForce GTX TITAN X/PCIe/SSE2
GL_VERSION=4.6.0 NVIDIA 410.73
GL_SHADING_LANGUAGE_VERSION=4.60 NVIDIA
finish loading shaders
100%|#########################################################################################################################################################################| 1/1 [00:00<00:00,  1.99it/s]
  9%|###############7                                                                                                                                                      | 18/190 [00:01<02:14,  1.28it/s]terminate called after throwing an instance of 'zmq::error_t'
  what():  Address already in use
100%|#####################################################################################################################################################################| 190/190 [00:12<00:00, 16.75it/s]
/root/mount/gibson/gibson/core/render/pcrender.py:204: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  self.imgv = Variable(torch.zeros(1, 3 , self.showsz, self.showsz), volatile = True).cuda()
/root/mount/gibson/gibson/core/render/pcrender.py:205: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  self.maskv = Variable(torch.zeros(1,2, self.showsz, self.showsz), volatile = True).cuda()

fxia22 commented 5 years ago

@jhpenger It is because previous depth_render process is not killed properly (probably because the main process crashed) and is still hogging the socket port. You can see that by running ps -ef | grep depth, can kill them with pkill depth.

jhpenger commented 5 years ago

For husky_env, can we use CPUs for policy evaluation or does it need GPU allocated to run?

fxia22 commented 5 years ago

@jhpenger If it requires rendering then GPU is needed. We use OpenGL rendering anyway so GPU is needed.

jhpenger commented 5 years ago

@fxia22 Which conv filters were used for train_husky_navigate_ppo1.py?

fxia22 commented 5 years ago

@jhpenger Depending on your resolution and quality settings, it will use one of the following from gibson/assets: model_64.pth, model_128.pth, model_256.pth, model_512.pth, model_small_64.pth, model_small_128.pth, model_small_256.pth.

jhpenger commented 5 years ago

@fxia22 For husky_env, using the configuration from husky_navigate_rgb_train.yaml the observation_space is box(128,128,4). but the observation returned by env.step(action) or env.reset() is a dictionary of size 23, 128, 128

with 64 resolution using husky_navigate_nonviz_train.yaml, env.observation_space is box(64, 64, 0); however, the obs returned from step() or reset() is either a dict or tuple of length 23 Thus, any env.observation_space.contains(np.array(observation)) always returns False

Does Gibson pad the observation and turn it to an array in some other steps?

I'm trying to Gibson with ray-project and I need step() and reset() to return values allowed in the observation space, so I am trying to figure these things out. Thanks,

jhpenger commented 5 years ago

Sorry, didn't read the code thoroughly 'fore I asked last question. Fixed it by adding

obs = np.concatenate([observations.get('rgb_filled'), observations.get('depth')], axis = 2) if 'depth' in observations.keys() else observation.get('rgb_filled')
return obs

to env_modalities.BaseRobotEnv.step & env_modalities.CameraRobotEnv.reset

btw @fxia22 , small typo on this line. Should be get_observations not get_observation https://github.com/StanfordVL/GibsonEnv/blob/8e9e23c2dcc44bb5442a25d12928103f05b51d8f/README.md#L454

jhpenger commented 5 years ago

@jhpenger If it requires rendering then GPU is needed. We use OpenGL rendering anyway so GPU is needed. @fxia22

I was testing husky_rgb training in headless mode, but a GPU was still used, how do I turn rendering off

fxia22 commented 5 years ago

@jhpenger You can set mode: headless and display_ui: false in the yaml config file.

jhpenger commented 5 years ago

@fxia22 I had that. but didnt work I was able to run 2 gibson-envs without using the another package for parallel training but got insufficient CUDA resources when I ran with the other package But in both cases, the logs still showed the GPU being used when creating the environments. That is whats causing my error.

number of devices found 1
Loaded EGL 1.5 after reload.
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=GeForce GTX TITAN X/PCIe/SSE2
GL_VERSION=4.6.0 NVIDIA 410.73
GL_SHADING_LANGUAGE_VERSION=4.60 NVIDIA
finish loading shaders
100%|################################################################################################################################################################################################################################################| 1/1 [00:00<00:00,  1.65it/s]
100%|############################################################################################################################################################################################################################################| 190/190 [00:13<00:00, 14.50it/s]
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.

fxia22 commented 5 years ago

GibsonEnv will use GPU even if in headless mode, headless means there is no display/X server needed, but GibsonEnv still uses OpenGL to render the frames and use pytorch to fill holes in the rendered frames. So it is expected that GibsonEnv will consume some GPU memory.

fxia22 commented 5 years ago

When running 256x256 resolution it uses about 1076MB + 163MB + 33MB GPU memory.

StanfordVL / GibsonEnv

Running on Google Cloud #52