Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
303 stars 31 forks source link

Issue with vector based observation setup for DreamerV3 #245

Closed LucaVendruscolo closed 6 months ago

LucaVendruscolo commented 6 months ago

Hello, I'm trying to implement a simple program to learn from a Pygame game I created. I plan to then transition to trying to beat an IRL labyrinth maze marble game that I currently have working with PPO but it just learns too slowly. I've tried to follow #162 but it's just too complicated for me. Any advice on what to do would be greatly appreciated!

Logs of the error ```shell (RL13) C:\Users\lucav\Downloads\RunningSheepRL\sheeprl>python sheeprl.py exp=dreamer_v3 env=BallGame CONFIG ├── algo │ └── name: dreamer_v3 │ total_steps: 5000000 │ per_rank_batch_size: 16 │ run_test: true │ cnn_keys: │ encoder: │ - rgb │ decoder: │ - rgb │ mlp_keys: │ encoder: [] │ decoder: [] │ world_model: │ optimizer: │ _target_: torch.optim.Adam │ lr: 0.0001 │ eps: 1.0e-08 │ weight_decay: 0 │ betas: │ - 0.9 │ - 0.999 │ discrete_size: 32 │ stochastic_size: 32 │ kl_dynamic: 0.5 │ kl_representation: 0.1 │ kl_free_nats: 1.0 │ kl_regularizer: 1.0 │ continue_scale_factor: 1.0 │ clip_gradients: 1000.0 │ encoder: │ cnn_channels_multiplier: 32 │ cnn_act: torch.nn.SiLU │ dense_act: torch.nn.SiLU │ mlp_layers: 2 │ layer_norm: true │ dense_units: 512 │ recurrent_model: │ recurrent_state_size: 512 │ layer_norm: true │ dense_units: 512 │ transition_model: │ hidden_size: 512 │ dense_act: torch.nn.SiLU │ layer_norm: true │ representation_model: │ hidden_size: 512 │ dense_act: torch.nn.SiLU │ layer_norm: true │ observation_model: │ cnn_channels_multiplier: 32 │ cnn_act: torch.nn.SiLU │ dense_act: torch.nn.SiLU │ mlp_layers: 2 │ layer_norm: true │ dense_units: 512 │ reward_model: │ dense_act: torch.nn.SiLU │ mlp_layers: 2 │ layer_norm: true │ dense_units: 512 │ bins: 255 │ discount_model: │ learnable: true │ dense_act: torch.nn.SiLU │ mlp_layers: 2 │ layer_norm: true │ dense_units: 512 │ actor: │ optimizer: │ _target_: torch.optim.Adam │ lr: 8.0e-05 │ eps: 1.0e-05 │ weight_decay: 0 │ betas: │ - 0.9 │ - 0.999 │ cls: sheeprl.algos.dreamer_v3.agent.Actor │ ent_coef: 0.0003 │ min_std: 0.1 │ init_std: 0.0 │ objective_mix: 1.0 │ dense_act: torch.nn.SiLU │ mlp_layers: 2 │ layer_norm: true │ dense_units: 512 │ clip_gradients: 100.0 │ expl_amount: 0.0 │ expl_min: 0.0 │ expl_decay: false │ max_step_expl_decay: 0 │ moments: │ decay: 0.99 │ max: 1.0 │ percentile: │ low: 0.05 │ high: 0.95 │ critic: │ optimizer: │ _target_: torch.optim.Adam │ lr: 8.0e-05 │ eps: 1.0e-05 │ weight_decay: 0 │ betas: │ - 0.9 │ - 0.999 │ dense_act: torch.nn.SiLU │ mlp_layers: 2 │ layer_norm: true │ dense_units: 512 │ target_network_update_freq: 1 │ tau: 0.02 │ bins: 255 │ clip_gradients: 100.0 │ gamma: 0.996996996996997 │ lmbda: 0.95 │ horizon: 15 │ train_every: 16 │ learning_starts: 65536 │ per_rank_pretrain_steps: 1 │ per_rank_gradient_steps: 1 │ per_rank_sequence_length: 64 │ layer_norm: true │ dense_units: 512 │ mlp_layers: 2 │ dense_act: torch.nn.SiLU │ cnn_act: torch.nn.SiLU │ unimix: 0.01 │ hafner_initialization: true │ decoupled_rssm: false │ player: │ discrete_size: 32 │ ├── buffer │ └── size: 1000000 │ memmap: true │ validate_args: false │ from_numpy: false │ checkpoint: false │ ├── checkpoint │ └── every: 100000 │ resume_from: null │ save_last: true │ keep_last: 5 │ ├── env │ └── id: pygame_ball_env │ num_envs: 4 │ frame_stack: 1 │ sync_env: false │ screen_size: 64 │ action_repeat: 1 │ grayscale: false │ clip_rewards: false │ capture_video: false │ frame_stack_dilation: 1 │ max_episode_steps: null │ reward_as_observation: false │ wrapper: │ _target_: sheeprl.envs.pygame_ball_env.PygameBallEnv │ ├── fabric │ └── _target_: lightning.fabric.Fabric │ devices: 1 │ num_nodes: 1 │ strategy: auto │ accelerator: cpu │ precision: 32-true │ callbacks: │ - _target_: sheeprl.utils.callback.CheckpointCallback │ keep_last: 5 │ └── metric └── log_every: 5000 disable_timer: false log_level: 1 sync_on_compute: false aggregator: _target_: sheeprl.utils.metric.MetricAggregator raise_on_missing: false metrics: Rewards/rew_avg: _target_: torchmetrics.MeanMetric sync_on_compute: false Game/ep_len_avg: _target_: torchmetrics.MeanMetric sync_on_compute: false Loss/world_model_loss: _target_: torchmetrics.MeanMetric sync_on_compute: false Loss/value_loss: _target_: torchmetrics.MeanMetric sync_on_compute: false Loss/policy_loss: _target_: torchmetrics.MeanMetric sync_on_compute: false Loss/observation_loss: _target_: torchmetrics.MeanMetric sync_on_compute: false Loss/reward_loss: _target_: torchmetrics.MeanMetric sync_on_compute: false Loss/state_loss: _target_: torchmetrics.MeanMetric sync_on_compute: false Loss/continue_loss: _target_: torchmetrics.MeanMetric sync_on_compute: false State/kl: _target_: torchmetrics.MeanMetric sync_on_compute: false State/post_entropy: _target_: torchmetrics.MeanMetric sync_on_compute: false State/prior_entropy: _target_: torchmetrics.MeanMetric sync_on_compute: false Params/exploration_amount: _target_: torchmetrics.MeanMetric sync_on_compute: false Grads/world_model: _target_: torchmetrics.MeanMetric sync_on_compute: false Grads/actor: _target_: torchmetrics.MeanMetric sync_on_compute: false Grads/critic: _target_: torchmetrics.MeanMetric sync_on_compute: false logger: _target_: lightning.fabric.loggers.TensorBoardLogger name: 2024-03-28_02-25-07_dreamer_v3_pygame_ball_env_42 root_dir: logs/runs/dreamer_v3/pygame_ball_env version: null default_hp_metric: true prefix: '' sub_dir: null Seed set to 42 C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\utils\logger.py:22: UserWarning: The specified root directory for the TensorBoardLogger is different from the experiment one, so the logger one will be ignored and replaced with the experiment root directory warnings.warn( Log dir: logs\runs\dreamer_v3/pygame_ball_env\2024-03-28_02-25-07_dreamer_v3_pygame_ball_env_42\version_0 C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\gym\spaces\box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32 logger.warn(f"Box bound precision lowered by casting to {self.dtype}") Environment reset for episode 2 Error executing job with overrides: ['exp=dreamer_v3', 'env=BallGame'] Traceback (most recent call last): File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\cli.py", line 347, in run run_algorithm(cfg) File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\cli.py", line 186, in run_algorithm fabric.launch(reproducible(command), cfg, **kwargs) File "C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\lightning\fabric\fabric.py", line 839, in launch return self._wrap_and_launch(function, self, *args, **kwargs) File "C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\lightning\fabric\fabric.py", line 925, in _wrap_and_launch return to_run(*args, **kwargs) File "C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\lightning\fabric\fabric.py", line 930, in _wrap_with_setup return to_run(*args, **kwargs) File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\cli.py", line 182, in wrapper return func(fabric, cfg, *args, **kwargs) File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\algos\dreamer_v3\dreamer_v3.py", line 407, in main envs = vectorized_env( File "C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\gymnasium\vector\async_vector_env.py", line 105, in __init__ dummy_env = env_fns[0]() File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\envs\wrappers.py", line 83, in __init__ super().__init__(self._env_fn()) File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\utils\env.py", line 149, in thunk raise ValueError( ValueError: The user specified keys `['rgb']` are not a subset of the environment `KeysView(Dict('motor_positions': Box(-inf, inf, (2,), float32), 'position': Box(0.0, [640. 480.], (2,), float32), 'velocity': Box(-60.0, 60.0, (2,), float32)))` observation keys. Please check your config file. Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. (RL13) C:\Users\lucav\Downloads\RunningSheepRL\sheeprl> ```
my pygame_ball_env ```python import pygame import numpy as np import sys # Initialize Pygame pygame.init() # Screen dimensions SCREEN_WIDTH, SCREEN_HEIGHT = 640, 480 # Overlay Surface overlay_surface = pygame.Surface((SCREEN_WIDTH, SCREEN_HEIGHT), pygame.SRCALPHA) # Colors BLACK = (0, 0, 0) WHITE = (255, 255, 255) RED = (255, 0, 0) # Ball properties start_pos = np.array([SCREEN_WIDTH // 2.0, SCREEN_HEIGHT // 2.0]) ball_pos = np.array(start_pos, copy=True) ball_radius = 20 velocity = np.array([0.0, 0.0]) # Starting speed acceleration = np.array([0.0, 0.0]) # Acceleration mask_array = np.zeros((SCREEN_HEIGHT, SCREEN_WIDTH), dtype=np.uint8) previous_total = 0 current_total = 0 # Red bars (including borders) red_bars = [ pygame.Rect(0, 0, SCREEN_WIDTH, 10), # Top border pygame.Rect(0, SCREEN_HEIGHT - 10, SCREEN_WIDTH, 10), # Bottom border pygame.Rect(0, 0, 10, SCREEN_HEIGHT), # Left border pygame.Rect(SCREEN_WIDTH - 10, 0, 10, SCREEN_HEIGHT), # Right border # Add more red bars as needed pygame.Rect(200, 300, 340, 20), # Example red bar in the middle pygame.Rect(200, 150, 240, 20), # Example red bar in the middle ] # Frames per second def reset_game(): global ball_pos, velocity, overlay_surface, current_total ,previous_total,mask_array ball_pos = np.array(start_pos, copy=True) velocity = np.array([0.0, 0.0]) overlay_surface.fill((0, 0, 0, 0)) # Clear overlay_surface mask_array = np.zeros((SCREEN_HEIGHT, SCREEN_WIDTH), dtype=np.uint8) previous_total = 0 current_total = 0 def update_ball(): global velocity, ball_pos, mask_array #print("BEFORE sum",np.sum(mask_array), "ball pos", np.array2string(ball_pos, formatter={'float_kind':lambda x: f"{x:.1f}"})) velocity += acceleration ball_pos += velocity*100.0 #print(acceleration) ball_pos = np.clip(ball_pos, ball_radius, np.array([SCREEN_WIDTH, SCREEN_HEIGHT]) - ball_radius) velocity *= (ball_pos >= ball_radius) & (ball_pos <= np.array([SCREEN_WIDTH, SCREEN_HEIGHT]) - ball_radius) mask_array = update_mask(ball_pos, 20, mask_array) # Check collision with red bars #print("AFTER sum",np.sum(mask_array), "ball pos", np.array2string(ball_pos, formatter={'float_kind':lambda x: f"{x:.1f}"})) def draw_overlay(): pygame.draw.circle(overlay_surface, WHITE, ball_pos.astype(int), ball_radius) def update_mask(ball_pos, radius, mask_array): # Create a grid of x, y coordinates y, x = np.ogrid[-radius: radius + 1, -radius: radius + 1] # Create a mask for pixels within the ball's radius mask = x**2 + y**2 <= radius**2 # Ball's position rounded to the nearest integer center_x, center_y = np.round(ball_pos).astype(int) # Determine the bounds for the mask application mask_x_min = max(center_x - radius, 0) mask_x_max = min(center_x + radius + 1, mask_array.shape[1]) mask_y_min = max(center_y - radius, 0) mask_y_max = min(center_y + radius + 1, mask_array.shape[0]) # Determine the bounds for the mask itself (to handle edges) mask_bounds_x_min = radius - (center_x - mask_x_min) mask_bounds_x_max = radius + (mask_x_max - center_x) mask_bounds_y_min = radius - (center_y - mask_y_min) mask_bounds_y_max = radius + (mask_y_max - center_y) # Apply the mask to the mask_array mask_array[mask_y_min:mask_y_max, mask_x_min:mask_x_max] |= mask[mask_bounds_y_min:mask_bounds_y_max, mask_bounds_x_min:mask_bounds_x_max] return mask_array def calculate_reward(mask_array, previous_total, velocity, min_new_pixels=10): # Calculate the current total of visited pixels current_total = np.sum(mask_array) # Calculate the difference in visited pixels new_pixels = current_total - previous_total # Define the reward logic based on new_pixels # Check if the velocity is above the minimum threshold if new_pixels > min_new_pixels: movement_reward = 10 else: movement_reward = -1 # Total reward for the current step #reward = exploration_reward + movement_reward reward = movement_reward return reward, current_total MOTOR_LIMIT = 60.0 MOTOR_STEP_SIZE = 20.0 # Initial motor positions (representing initial board tilt) motor_positions = np.array([0.0, 0.0]) def move_motors(target_x, target_y): global motor_positions # Define maximum movement allowed for each motor MAX_MOVE_PER_AXIS = 20 # Determine the desired move direction for each axis desired_move = np.array([target_x, target_y]) - motor_positions[:2] # Assuming motor_positions[:2] contains x and y positions # Limit the move to MAX_MOVE_PER_AXIS per axis limited_move = np.clip(desired_move, -MAX_MOVE_PER_AXIS, MAX_MOVE_PER_AXIS) # Update motor positions with the limited move, ensure within MOTOR_LIMIT motor_positions[:2] += limited_move motor_positions[:2] = np.clip(motor_positions[:2], -MOTOR_LIMIT, MOTOR_LIMIT) # Ensure the function accounts for acceleration update if necessary # Update the ball's acceleration based on the new motor positions global acceleration acceleration = motor_positions[:2] / MOTOR_LIMIT # Adjust as necessary for your model return motor_positions import gymnasium as gym from gym import spaces import pygame import numpy as np class PygameBallEnv(gym.Env): metadata = {'render.modes': ['human']} def __init__(self): super(PygameBallEnv, self).__init__() self.screen_width = 640 self.screen_height = 480 self.action_space = spaces.Box(low=np.array([-1.0, -1.0]), high=np.array([1.0, 1.0]), dtype=np.float32) # Adjusted observation space to include velocity self.observation_space = spaces.Dict({ 'position': spaces.Box(low=np.array([0, 0]), high=np.array([self.screen_width, self.screen_height]), dtype=np.float32), 'velocity': spaces.Box(low=np.array([-60, -60]), high=np.array([60, 60]), dtype=np.float32), 'motor_positions': spaces.Box(low=np.array([-np.inf, -np.inf]), high=np.array([np.inf, np.inf]), dtype=np.float32), }) self.world_image = np.zeros((int(self.screen_width), int(self.screen_height), 3)) self.screen = None self.reward = 0 self.totalreward = 0 self.all_rewards = [] # To track rewards for each episode self.avg_rewards_every_200 = [] self.clock = pygame.time.Clock() self.episode_counter = 1 # Define additional attributes like ball position, velocity, etc. self.reset() def check_collision(self): ball_rect = pygame.Rect(ball_pos[0] - ball_radius, ball_pos[1] - ball_radius, ball_radius * 2, ball_radius * 2) collision = any(bar.colliderect(ball_rect) for bar in red_bars) return collision def step(self, action): global motor_positions, ball_pos, velocity, acceleration, mask_array, current_total, previous_total # Apply the action to the environment motor_positions = move_motors(action[0], action[1]) # Update the environment's state update_ball() previous_total = current_total # Update the call to calculate_reward to include velocity self.reward, current_total = calculate_reward(mask_array, previous_total, velocity) self.totalreward += self.reward # Check for episode termination (e.g., ball collision with red bars) done = self.check_collision() self.render() # Construct the observation (state) with ball position, velocity, and motor positions observation = { 'position': ball_pos, 'velocity': velocity, 'motor_positions': motor_positions, } # Optionally, return extra information info = {} if done: self.reward = 0 self.all_rewards.append(self.totalreward) return observation, self.reward, done, info def reset(self): self.reward = 0 self.totalreward = 0 reset_game() self.episode_counter += 1 observation = { 'position': ball_pos, # Assuming ball_pos is updated by reset_game() to initial position 'velocity': velocity, # Assuming velocity is updated by reset_game() to initial velocity 'motor_positions': motor_positions, # Assuming motor_positions are reset by reset_game() } print(f"Environment reset for episode {self.episode_counter}") if self.episode_counter % 2000 == 0 and self.episode_counter != 0: # Calculate the average reward for the last 200 episodes avg_reward = np.mean(self.all_rewards[-200:]) self.avg_rewards_every_200.append(avg_reward) # Plot the graph plt.figure(figsize=(10, 5)) plt.plot(self.all_rewards, marker='o', linestyle='-') plt.title('Average Reward Every 200 Episodes') plt.xlabel('200 Episodes Batch') plt.ylabel('Average Reward') plt.grid(True) plt.show(block=False) # Show the plot without blocking the execution plt.pause(3) # Pause for 3 seconds plt.close() return observation def render(self, mode='human'): if self.episode_counter % 2000 + 500 == 0: # Render only every 100 episodes if mode == 'human': # Ensure the display is initialized only once and when needed global screen if self.screen is None: self.screen = pygame.display.set_mode((SCREEN_WIDTH, SCREEN_HEIGHT)) # Fill the screen with black (or any other background color) self.screen.fill(BLACK) # Draw the ball and red bars on the overlay surface/screen draw_overlay() self.screen.blit(overlay_surface, (0, 0)) for bar in red_bars: pygame.draw.rect(self.screen, RED, bar) # Draw the ball directly on the screen (optional if already done in draw_overlay) pygame.draw.circle(self.screen, WHITE, ball_pos.astype(int), ball_radius) # Update the display pygame.display.flip() clock = pygame.time.Clock() clock.tick(30) elif self.screen is not None: print("Quitting display...") pygame.display.quit() self.screen = None clock = None def close(self): pygame.quit() import os from matplotlib import pyplot as plt ```
my BallGame.yaml ``` defaults: - default - _self_ # Environment ID id: pygame_ball_env # Environment-specific parameters (if any) action_repeat: 1 capture_video: False reward_as_observation: False # Environment instantiation details wrapper: _target_: sheeprl.envs.pygame_ball_env.PygameBallEnv ```

The link to the google drive with the game in SheepRL: https://drive.google.com/file/d/1HTO-xTNNRg689TQCSizniawCX_5vtIkH/view?usp=sharing

I made a BallGame.yaml and a pygame_ball_env.py that has the game in it

LucaVendruscolo commented 6 months ago

Im now getting this

C:\Users\lucav\Downloads\RunningSheepRL\sheeprl>python sheeprl.py exp=dreamer_v3 env=BallGame 
Seed set to 42
C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\utils\logger.py:22: UserWarning: The specified root directory for the TensorBoardLogger is different from the experiment one, so the logger one will be ignored and replaced with the experiment root directory
  warnings.warn(
Log dir: logs\runs\dreamer_v3/pygame_ball_env\2024-03-28_03-55-35_dreamer_v3_pygame_ball_env_42\version_0
C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\gym\spaces\box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
Environment reset for episode 2
Error executing job with overrides: ['exp=dreamer_v3', 'env=BallGame']
Traceback (most recent call last):
  File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\cli.py", line 347, in run
    run_algorithm(cfg)
  File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\cli.py", line 186, in run_algorithm
    fabric.launch(reproducible(command), cfg, **kwargs)
  File "C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\lightning\fabric\fabric.py", line 839, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\lightning\fabric\fabric.py", line 925, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\lightning\fabric\fabric.py", line 930, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\cli.py", line 182, in wrapper
    return func(fabric, cfg, *args, **kwargs)
  File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\algos\dreamer_v3\dreamer_v3.py", line 407, in main
    envs = vectorized_env(
  File "C:\Users\lucav\anaconda3\envs\RL13\lib\site-packages\gymnasium\vector\async_vector_env.py", line 105, in __init__
    dummy_env = env_fns[0]()
  File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\envs\wrappers.py", line 83, in __init__
    super().__init__(self._env_fn())
  File "C:\Users\lucav\Downloads\RunningSheepRL\sheeprl\sheeprl\utils\env.py", line 143, in thunk
    set(k for k in env.observation_space.keys()).intersection(
AttributeError: 'Box' object has no attribute 'keys'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I switched to this observation space:

        low = np.array([0, 0, -60, -60, -60, -60])
        high = np.array([640, 480, 60, 60, 60, 60])

        self.observation_space = spaces.Box(low=low, high=high, dtype=np.float32)
belerico commented 6 months ago

Hi @LucaVendruscolo, have you solved your issue?

LucaVendruscolo commented 6 months ago

Hi @belerico, thanks for asking. I wanted give it an extra day to try to debug it myself as I was managing to get different error messages but I got stuck again today. I have opened a new issue with cleaner code to implement the game I made so it should be easier for you to understand whats going on. If you could have a look at that that would be much appreciated!