Farama-Foundation / MO-Gymnasium

Multi-objective Gymnasium environments for reinforcement learning
http://mo-gymnasium.farama.org/
MIT License
274 stars 36 forks source link

Inconsistent observation dimension for fishwood #102

Closed timondesch closed 3 weeks ago

timondesch commented 3 weeks ago

I noticed while trying to run the MORL-Baselines implementation of EUPG on other environments than fishwood that there seem to be an inconsistency in the observation shape for fishwood compared to the other MO-Gymnasium environment, and one in the action function of EUPG compared to the other algorithms of MORL-Baselines. Sorry if I am missing something that should be obvious !

In short:

The cause seems to be that the observations generated by the fishwood environment are 0-dimensional arrays, whereas following the convention used for other environments they should be 1-dimensional arrays of length 1. To compensate for this, EUPG increases the dimensionality of received observations by one before sending them to the action function, making it work for fishwood but not for any other environment.

We can see the discrepancy in the dimension of the observations with

import mo_gymnasium as mo_gym

print(mo_gym.make("fishwood-v0").observation_space.shape)
print(mo_gym.make("deep-sea-treasure-v0").observation_space.shape)
print(mo_gym.make("resource-gathering-v0").observation_space.shape)

which returns:

()
(2,)
(4,)

Similarly, the following test adapted from the EUPG and MOQL examples

import mo_gymnasium as mo_gym
import numpy as np
import torch as th
from mo_gymnasium.utils import MORecordEpisodeStatistics

from morl_baselines.single_policy.esr.eupg import EUPG
from morl_baselines.single_policy.ser.mo_q_learning import MOQLearning

if __name__ == "__main__":

    def scalarization(reward: np.ndarray, w=None):
        reward = th.tensor(reward) if not isinstance(reward, th.Tensor) else reward
        # Handle the case when reward is a single tensor of shape (2, )
        if reward.dim() == 1 and reward.size(0) == 2:
            return min(reward[0], reward[1] // 2).item()

        # Handle the case when reward is a tensor of shape (200, 2)
        elif reward.dim() == 2 and reward.size(1) == 2:
            return th.min(reward[:, 0], reward[:, 1] // 2)

    envs = [mo_gym.make("fishwood-v0"), mo_gym.make("deep-sea-treasure-v0")]

    for env in envs:
        agents= [EUPG(env, scalarization=scalarization, log=False),
                    MOQLearning(env, scalarization=scalarization, log=False)]
        for agent in agents:
            try:
                agent.train(total_timesteps=1, start_time=0)
                print(f"{agent.__class__.__name__} succeeded on {env.unwrapped.spec.id}")
            except Exception as e:
                print(f"{agent.__class__.__name__} failed on {env.unwrapped.spec.id}")

returns:

EUPG succeeded on fishwood-v0
MOQLearning failed on fishwood-v0
/home/timon/miniconda3/envs/issue_morl_baslines/lib/python3.11/site-packages/morl_baselines/single_policy/esr/eupg.py:295: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
  action = self.__choose_action(th.Tensor([obs]).to(self.device), accrued_reward_tensor)
EUPG failed on deep-sea-treasure-v0
MOQLearning succeeded on deep-sea-treasure-v0

Following what I said before, I think a change needs to be done in the source files for both fishwood and EUPG. I duplicated this issue in MORL-Baselines, and if both sides give me the green light I can test the issue further, confirm the cause and create a PR with a fix.

In MO-Gymnasium

The following line: https://github.com/Farama-Foundation/MO-Gymnasium/blob/7087d48280ff715dac46a531702903b9aa71f986/mo_gymnasium/envs/fishwood/fishwood.py#L58 should be changed to:

self.observation_space = spaces.Box(low=0, high=1, shape=(1,), dtype=np.int32)

In MORL-Baselines

The following line: https://github.com/LucasAlegre/morl-baselines/blob/7c2d96a66a9ba8b407558db68aeac3e66858ebc0/morl_baselines/single_policy/esr/eupg.py#L295 should be changed to:

action = self.__choose_action(th.Tensor(obs).to(self.device), accrued_reward_tensor)

in order to mimic what is done in other algorithms where the observation is given to the action function as is (e.g. MO-Q-learning, CAPQL...).

I also noticed that the self.experiment_name is not set in MOQLearning, I can include that in the PR if need be.

LucasAlegre commented 3 weeks ago

See https://github.com/LucasAlegre/morl-baselines/issues/118#issuecomment-2334609858

LucasAlegre commented 3 weeks ago

Fixed in #103