Bug description

Hi, I'm currently adapting the Inverse Reinforcement Learning algorithm to analyze the behavior of mice in our lab studies. For this, I have used the Maximum Causal Entropy (MCE) algorithm provided in the Imitation library. Unfortunately, I've encountered an issue where the trained reward function consistently outputs zero, as indicated by the statistics below:

Imitation stats: { 'n_traj': 819, 'monitor_return_len': 819, 'return_min': 0.0, 'return_mean': 0.0, 'return_std': 0.0, 'return_max': 0.0, 'len_min': 1, 'len_mean': 6.1953601953601956, 'len_std': 7.617115100395745, 'len_max': 51, 'monitor_return_min': 0, 'monitor_return_mean': 0.0, 'monitor_return_std': 0.0, 'monitor_return_max': 0 }

This issue may be related to my observation that the gradient norms are increasing rather than converging during the training process. For some context, my environment has 50 discrete states in the state space and 8 possible discrete actions. The observation space is identical to the state space, which features unique discrete values for characteristics like reaching velocity and reaching outcome. The shape of my transition matrix is 50, 8, 50 while the observation matrix is a one-hot encoded matrix of shape 50,50. My current trajectories, which are lab data, don't have rewards. I've chosen to work with discrete spaces due to the library's limitations on continuous spaces(though I do want to try the airl algorithm later).

I've done my best to troubleshoot the issue, but my efforts have yet to yield a solution. I would greatly appreciate your insights into why this might be happening.

Steps to reproduce

It will be hard to show the full code but I can show my environment construction:

class MouseBehaviorDiscreteEnv(gym.Env): def init(self, transition_prob_df, initial_state_dist): super(MouseBehaviorDiscreteEnv, self).init()

    # Action space and observation space
    n_actions = transition_prob_df['action'].nunique()
    n_states = transition_prob_df['state'].nunique()

    self.state_dim = n_states
    self.action_dim = n_actions

    # Initialize the transition probabilities DataFrame
    self.transition_prob_df = transition_prob_df

    self.action_space = Discrete(n_actions)
    self.state_space = Discrete(n_states)  # Renamed from state_space to observation_space for standardization
    self.observation_space = Discrete(n_states)

    # Initialize the observation matrix
    self.transition_matrix = np.zeros((n_states, n_actions, n_states))
    self.initial_state_dist =initial_state_dist 

    for _, row in transition_prob_df.iterrows():
        s = int(row['state'])
        a = int(row['action'])
        s_prime = int(row['next_state'])
        prob = row['probability']
        self.transition_matrix[s,a , s_prime] = prob  # Change the indexing order here

    # Reshape or slice your observation matrix as needed
    # For example, to only consider the first action and state

self.observation_matrix = self.observation_matrix[:,:,0]

    # Initialize the observation matrix with zeros
    self.observation_matrix = np.zeros((n_states, n_states))

    # One-hot encode each state
    for i in range(n_states):
        self.observation_matrix[i, i] = 1

    # Add a singleton dimension to make it 2D (n_states, 1)

    self.horizon = 100

    # Initialize state
    self.reset()

def reset(self):
    # Initialize to a random state (or however you prefer)
    self.state = self.transition_prob_df.state.sample(1).iloc[0]
    return self.state

def step(self, action):
    # Filter DataFrame for rows that match the current state and action
    applicable_rows = self.transition_prob_df[(self.transition_prob_df['state'] == self.state) & 
                                              (self.transition_prob_df['action'] == action)]

    print(f"Applicable rows for state {self.state} and action {action}:")
    print(applicable_rows)
    # Check if we have any valid transitions
    if applicable_rows.empty:
        return self.state, 0, True, {}

    # Extract probabilities for the next state
    probabilities = applicable_rows['probability'].values

    # Sample next state based on the probabilities
    next_state = np.random.choice(applicable_rows['next_state'], p=probabilities)

    # Update state, reward, and done flag
    self.state = next_state
    reward = 0  # You can customize this
    done = False  # You can customize this

    return self.state, reward, done, {}

Environment

Operating system and version: Apple M2 Pro 13.5.2 (22G91)
Python version: 3.9.0
Output of pip freeze --all:

HumanCompatibleAI / imitation

Trained reward function outputs constant zeros using MCE algorithm #808

Bug description

Steps to reproduce

self.observation_matrix = self.observation_matrix[:,:,0]

Environment