agi-brain / xuance

XuanCe: A Comprehensive and Unified Deep Reinforcement Learning Library
https://xuance.readthedocs.io/
MIT License
592 stars 100 forks source link

Support for 2D Observation Spaces in PPO with Torch #54

Closed Abdullah2020 closed 1 month ago

Abdullah2020 commented 1 month ago

I'm working on DRL framework using the PPO agent with Torch and experienced a difference in how observation spaces are handled. The example in the documentation defines the observation space as: self.observation_space = Box(-np.inf, np.inf, shape=[18, ]). which creates a 1D observation space with a shape of (18,).

In my custom environment, I defined a 2D observation space like this: self.observation_space = Box(low=-0, high=np.inf, shape=(num_ed, 6), dtype=np.float32). This results in a 2D observation space with a shape of (num_ed, 6).

Basically, my questions are:

Any guidance or suggestions would be greatly appreciated. Thank you!

wenzhangliu commented 1 month ago

Currently, the PPO agent in XuanCe supports 2D observation space, for example, an agent takes the image as observations. You can see the example of PPO for Atari tasks: https://github.com/agi-brain/xuance/blob/master/examples/ppo/ppo_atari.py.

If your observation is 2D but is not an image, it is suggested to flatten the original input in your customized environment before returning it from env.reset() and env.step() methods. Specifically, reshape the observation from (num_ed, 6) to (num_ed*6, ).

Hope this helps. Thank you for supporting XuanCe.

Abdullah2020 commented 1 month ago

I was able to edit the ./representations/mlp.py file and now my PPO DRL model is working perfectly on my custom env with a very cool convergence curve. Thank you for the assistance.

Abdullah2020 commented 1 month ago

Hello @wenzhangliu,

I've transformed my custom environment into a MARL environment using the RawMultiAgentEnv framework. However, I'm encountering a matrix shape mismatch error during the training phase of my MARL environment using the MAPPO algorithm. Specifically, the error occurs at the input shape of the neural network (NN) where my environment has an Input shape: torch.Size([4, 12]), but the NN expects in_features to be 60. There seems to be some multiplication happening between the obs_space and state_space.

Despite my efforts to debug the issue, I'm having trouble identifying the root cause of the mismatch between the expected and actual tensor shapes. Is there a way I can handle the expected input features from the configuration files (.yaml)? I can control the size of some hidden layers, but not the actual NN input. Could you please provide some guidance on this? Also, I have tried running the example for MARL env here but kept throwing same mismatch error.

Some details of my environment:


self.state_space = Box(-np.inf, np.inf, shape=[num_ed, 6])
self.observation_space = {agent: Box(low=0, high=np.inf, shape=[num_ed, 6]) for agent in self.agents}  # shape=[num_ed, 6]
self.action_space = {agent: Discrete(n=action_size) for agent in self.agents}
wenzhangliu commented 1 month ago

Hi, if the NN expects in_features to be 60 but gets input_size of 4*12, I guess you have set the use_global_state as True in .yaml file. However, XuanCe allows that setting in MAPPO. Could you please provide more information about the error printed in the terminal?

Abdullah2020 commented 1 month ago

In my .yaml file, use_global_state is set to False. I tried both ways but still, it isn't working. Also, below is the error printout from the terminal

(xuance_env) [xxxed@xxe06 mappo]$ python mappo_lora.py 
Observations is a list with length: 2
Shape of observation 0: (2, 6)
Shape of observation 1: (2, 6)
Observations is a list with length: 2
Shape of observation 0: (2, 6)
Shape of observation 1: (2, 6)
Creating layer with input shape: 60 and output shape: 64
Creating layer with input shape: 32 and output shape: 64
  0%|                                                                                                                               | 0/5000000 [00:00<?, ?it/s]
Original observation shape: (4, 12)
Reshaped observation to: (4, 12)
Tensor observation shape before MLP: torch.Size([4, 12])
Input shape: torch.Size([4, 12]), Expected in_features: 60
  0%|                                                                                                                               | 0/5000000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "mappo_lora.py", line 400, in <module>
    Agent.train(configs.running_steps // configs.parallels)  # Train the model for numerous steps.
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/xuance/torch/agents/core/on_policy_marl.py", line 297, in train
    policy_out = self.action(obs_dict=obs_dict, state=state, avail_actions_dict=avail_actions, test_mode=False)
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/xuance/torch/agents/multi_agent_rl/mappo_agents.py", line 134, in action
    rnn_hidden=rnn_hidden_actor)
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/xuance/torch/policies/categorical_marl.py", line 118, in forward
    outputs = self.actor_representation[key](observation[key])
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/xuance/torch/representations/mlp.py", line 177, in forward
    output = self.model(tensor_observation)
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.conda/envs/xuance_env/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 121, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x12 and 60x64)

Also, here is my .yaml file incase i'm doing something wrong.

dl_toolbox: "torch"  # The deep learning toolbox. Choices: "torch", "mindspore", "tensorlayer"
logger: "tensorboard"  # Choices: tensorboard, wandb.
render: False # Whether to render the environment when testing.
render_mode: 'rgb_array' # Choices: 'human', 'rgb_array'.
fps: 15
test_mode: False
device: "cuda:0"

agent: "MAPPO"  # The agent name.
env_name: "LoRaMultiAgent"  # The environment name.
env_id: "multi_lora_v1"  # The environment id.
continuous_action: False # True  # If to use continuous control.
learner: "MAPPO_Clip_Learner"
policy: "Categorical_MAAC_Policy" # "Gaussian_MAAC_Policy"  # The policy name. choice: Gaussian_AC for continuous actions, Categorical_AC for discrete actions.
representation: "Basic_MLP" #"Basic_MLP"  # The representation name.
vectorize: "DummyVecMultiAgentEnv" # "SubprocVecMultiAgentEnv"  or "DummyVecMultiAgentEnv" # The method to vectorize your environment such that can run in parallel.

# recurrent settings for Basic_RNN representation.
use_rnn: False # False  # If to use recurrent neural network as representation. (The representation should be "Basic_RNN").
rnn: "GRU"  # The type of recurrent layer.
fc_hidden_sizes: [64, 64, 64]  # The hidden size of feed forward layer in RNN representation.
recurrent_hidden_size: 64  # The hidden size of the recurrent layer.
N_recurrent_layers: 1  # The number of recurrent layer.
dropout: 0  # dropout should be a number in range [0, 1], the probability of an element being zeroed.
normalize: "LayerNorm"  # Layer normalization.
initialize: "orthogonal"  # Network initializer.
gain: 0.01

# recurrent settings for Basic_RNN representation.
representation_hidden_size: [64, ]  # A list of hidden units for each layer of Basic_MLP representation networks.
actor_hidden_size: [64, ]  # A list of hidden units for each layer of actor network.
critic_hidden_size: [64, ]  # A list of hidden units for each layer of critic network.
activation: "relu"  # The activation function of each hidden layer.
activation_action: "sigmoid"  # The activation function for the last layer of the actor.
use_parameter_sharing: True  # If to use parameter sharing for all agents' policies.
use_actions_mask: False  # If to use actions mask for unavailable actions.

seed: 1  # Random seed.
parallels: 2 #16  # The number of environments to run in parallel.
buffer_size: 400  # Number of the transitions (use_rnn is False), or the episodes (use_rnn is True) in replay buffer.
n_epochs: 5 #1  # Number of epochs to train.
n_minibatch: 1  # Number of minibatch to sample and train.  batch_size = buffer_size // n_minibatch.
learning_rate: 0.0007  # Learning rate.
weight_decay: 0  # The steps to decay the greedy epsilon.

vf_coef: 0.5  # Coefficient factor for critic loss.
ent_coef: 0.01  # Coefficient factor for entropy loss.
target_kl: 0.25  # For MAPPO_KL learner.
clip_range: 0.2  # Ratio clip range, for MAPPO_Clip learner.
clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm().
gamma: 0.99 #0.95  # Discount factor.

# tricks
use_linear_lr_decay: False  # If to use linear learning rate decay.
end_factor_lr_decay: 0.5  # The end factor for learning rate scheduler.
use_global_state: False  # If to use global state to replace merged observations.
use_value_clip: True  # Limit the value range.
value_clip_range: 0.2  # The value clip range.
use_value_norm: True  # Use running mean and std to normalize rewards.
use_huber_loss: True  # True: use huber loss; False: use MSE loss.
huber_delta: 10.0  # The threshold at which to change between delta-scaled L1 and L2 loss. (For huber loss).
use_advnorm: True  # If to use advantage normalization.
use_gae: True  # Use GAE trick.
gae_lambda: 0.95  # The GAE lambda.
use_grad_clip: True  # Gradient normalization.
grad_clip_norm: 10.0  # The max norm of the gradient.

running_steps: 10000000  # The total running steps.
eval_interval: 100000  # The interval between every two trainings.
test_episode: 5  # The episodes to test in each test period.

log_dir: "./logs/mappo/"
model_dir: "./models/mappo/"
wenzhangliu commented 1 month ago

Hi, @Abdullah2020 ,

Based on the information you've provided, it seems you've set the observation shape for each agent to [4*12]. However, I'm not sure why the expected input shape would be 60. In my view, if use_parameter_sharing is set to True, the observation shape should be a 1-D vector with a size of dim_obs + n_agents.

You could try setting a breakpoint at line 118 in the file '/.conda/envs/xuance_env/lib/python3.7/site-packages/xuance/torch/policies/categorical_marl.py' and print observation[key].shape and self.actor_representation[key] in consoles for more details. This might help you identify the cause of the problem.

Abdullah2020 commented 1 month ago

Hi, @wenzhangliu,

Thank you for the feedback. I will work on that. In the main time, apart from Box and discrete spaces, does xuance (RawEnvironment or RawMultiAgentEnv) framework accept spaces like MultiBinary and MultiDiscrete as seen in gymnasium?? if so, what will the equivalent policy in the .yaml file for the configuration setup.

policy: "???"  # choice: Gaussian_AC for continuous actions, Categorical_AC for discrete actions.
wenzhangliu commented 1 month ago

Currently, action spaces such as MultiBinary and MultiDiscrete are not supported.