Farama-Foundation / Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
https://gymnasium.farama.org
MIT License
7.18k stars 797 forks source link

[Proposal] Add RGB-D Render Mode to MuJoCo Environments #1226

Open DavidPL1 opened 3 days ago

DavidPL1 commented 3 days ago

Proposal

My proposal is to add a new render_mode to MuJoCo environments for when RGB and Depth images are required as observations, e.g. to create point clouds.

(related issue: #727)

Motivation

Currently one can achieve this by calling MujocoEnv.mujoco_renderer.render twice with both render_mode=rgb_array and render_mode=depth_array respectively. However, internally rgb and depth are always rendered and only one is returned based on the render_mode. Therefore, offering to return both rgb and depth in one step saves the effort of rendering both images twice.

Pitch

'rgbd_array' should be added as valid render_mode that returns both the rgb and the depth image either separately or as a 4 channel image (which entails conversion from uint8 to float32 and back for rgb).

Alternatives

No response

Additional context

I've already implemented a solution in my fork that returns both images separately.

Here's a comparison of execution times produced by the following script:

import timeit
import numpy as np

setup_render_twice = """
import gymnasium as gym
env = gym.make('Pusher-v5', render_mode='rgb_array')
env.reset()
"""

render_twice = """
rgb = env.get_wrapper_attr('mujoco_renderer').render('rgb_array')
depth = env.get_wrapper_attr('mujoco_renderer').render('depth_array')
"""

setup_render_once = """
import gymnasium as gym
env = gym.make('Pusher-v5', render_mode='rgbd_array')
env.reset()
"""

render_once = """
rgb, depth = env.render()
"""

times_twice = timeit.repeat(setup=setup_render_twice, stmt=render_twice, number=10, repeat=10000)
print(f"Render twice:\n\tmean: {np.mean(times_twice):.4f}+-{np.std(times_twice):.4f}\n\tmin: {np.min(times_twice):.4f}\n\tmax: {np.max(times_twice):.4f}\n")

times_once = timeit.repeat(setup=setup_render_once, stmt=render_once, number=10, repeat=10000)
print(f"RGB-D:\n\tmean: {np.mean(times_once):.4f}+-{np.std(times_once):.4f}\n\tmin: {np.min(times_once):.4f}\n\tmax: {np.max(times_once):.4f}\n")

Results:

Render twice:
        mean: 0.0301+-0.0018
        min: 0.0283
        max: 0.0633

RGB-D:
        mean: 0.0136+-0.0010
        min: 0.0127
        max: 0.0246

Checklist

pseudo-rnd-thoughts commented 2 days ago

@DavidPL1 Thanks for the proposal. @Kallinteris-Andreas will know best, but I think this sounds like a reasonable idea. Having a quick look at your implementation, currently you return a tuple of the rgb and the depth. However as they have the same shape, it would be easier to concatenate the results as rgbd of (480, 480, 4)

DavidPL1 commented 2 days ago

However as they have the same shape, it would be easier to concatenate the results as rgbd of (480, 480, 4)

Correct, however, the rgb image is of type np.uint8 while the depth image is of type np.float32. So this more of a design choice whether to have both images usable the same way as they are returned from the individual render modes or nicely fit them in a single array but require users to convert rgb back to uint8 from float32.

Kallinteris-Andreas commented 2 days ago

I am not familiar with how rgbd inputs are handled in CNNs, we should choose the output format for For whatever input is convenient for CNNs.

pseudo-rnd-thoughts commented 2 days ago

the rgb image is of type np.uint8 while the depth image is of type np.float32.

Ahh that makes sense, lets make a tuple like your implementation

@DavidPL1 could you make a PR with your branch?