Hi, I'm currently adapting the Inverse Reinforcement Learning algorithm to analyze the behavior of mice in our lab studies. For this, I have used the Maximum Causal Entropy (MCE) algorithm provided in the Imitation library. Unfortunately, I've encountered an issue where the trained reward function consistently outputs zero, as indicated by the statistics below:
This issue may be related to my observation that the gradient norms are increasing rather than converging during the training process. For some context, my environment has 50 discrete states in the state space and 8 possible discrete actions. The observation space is identical to the state space, which features unique discrete values for characteristics like reaching velocity and reaching outcome. The shape of my transition matrix is 50, 8, 50 while the observation matrix is a one-hot encoded matrix of shape 50,50. My current trajectories, which are lab data, don't have rewards. I've chosen to work with discrete spaces due to the library's limitations on continuous spaces(though I do want to try the airl algorithm later).
I've done my best to troubleshoot the issue, but my efforts have yet to yield a solution. I would greatly appreciate your insights into why this might be happening.
Steps to reproduce
It will be hard to show the full code but I can show my environment construction:
class MouseBehaviorDiscreteEnv(gym.Env):
def init(self, transition_prob_df, initial_state_dist):
super(MouseBehaviorDiscreteEnv, self).init()
# Action space and observation space
n_actions = transition_prob_df['action'].nunique()
n_states = transition_prob_df['state'].nunique()
self.state_dim = n_states
self.action_dim = n_actions
# Initialize the transition probabilities DataFrame
self.transition_prob_df = transition_prob_df
self.action_space = Discrete(n_actions)
self.state_space = Discrete(n_states) # Renamed from state_space to observation_space for standardization
self.observation_space = Discrete(n_states)
# Initialize the observation matrix
self.transition_matrix = np.zeros((n_states, n_actions, n_states))
self.initial_state_dist =initial_state_dist
for _, row in transition_prob_df.iterrows():
s = int(row['state'])
a = int(row['action'])
s_prime = int(row['next_state'])
prob = row['probability']
self.transition_matrix[s,a , s_prime] = prob # Change the indexing order here
# Reshape or slice your observation matrix as needed
# For example, to only consider the first action and state
# Initialize the observation matrix with zeros
self.observation_matrix = np.zeros((n_states, n_states))
# One-hot encode each state
for i in range(n_states):
self.observation_matrix[i, i] = 1
# Add a singleton dimension to make it 2D (n_states, 1)
self.horizon = 100
# Initialize state
self.reset()
def reset(self):
# Initialize to a random state (or however you prefer)
self.state = self.transition_prob_df.state.sample(1).iloc[0]
return self.state
def step(self, action):
# Filter DataFrame for rows that match the current state and action
applicable_rows = self.transition_prob_df[(self.transition_prob_df['state'] == self.state) &
(self.transition_prob_df['action'] == action)]
print(f"Applicable rows for state {self.state} and action {action}:")
print(applicable_rows)
# Check if we have any valid transitions
if applicable_rows.empty:
return self.state, 0, True, {}
# Extract probabilities for the next state
probabilities = applicable_rows['probability'].values
# Sample next state based on the probabilities
next_state = np.random.choice(applicable_rows['next_state'], p=probabilities)
# Update state, reward, and done flag
self.state = next_state
reward = 0 # You can customize this
done = False # You can customize this
return self.state, reward, done, {}
Environment
Operating system and version: Apple M2 Pro 13.5.2 (22G91)
This will be hard to reproduce without your training data and. Also you will most probably need to do hyperparameter tuning to get any meaningful results.
Bug description
Hi, I'm currently adapting the Inverse Reinforcement Learning algorithm to analyze the behavior of mice in our lab studies. For this, I have used the Maximum Causal Entropy (MCE) algorithm provided in the Imitation library. Unfortunately, I've encountered an issue where the trained reward function consistently outputs zero, as indicated by the statistics below:
Imitation stats: { 'n_traj': 819, 'monitor_return_len': 819, 'return_min': 0.0, 'return_mean': 0.0, 'return_std': 0.0, 'return_max': 0.0, 'len_min': 1, 'len_mean': 6.1953601953601956, 'len_std': 7.617115100395745, 'len_max': 51, 'monitor_return_min': 0, 'monitor_return_mean': 0.0, 'monitor_return_std': 0.0, 'monitor_return_max': 0 }
This issue may be related to my observation that the gradient norms are increasing rather than converging during the training process. For some context, my environment has 50 discrete states in the state space and 8 possible discrete actions. The observation space is identical to the state space, which features unique discrete values for characteristics like reaching velocity and reaching outcome. The shape of my transition matrix is 50, 8, 50 while the observation matrix is a one-hot encoded matrix of shape 50,50. My current trajectories, which are lab data, don't have rewards. I've chosen to work with discrete spaces due to the library's limitations on continuous spaces(though I do want to try the airl algorithm later).
I've done my best to troubleshoot the issue, but my efforts have yet to yield a solution. I would greatly appreciate your insights into why this might be happening.
Steps to reproduce
It will be hard to show the full code but I can show my environment construction:
class MouseBehaviorDiscreteEnv(gym.Env): def init(self, transition_prob_df, initial_state_dist): super(MouseBehaviorDiscreteEnv, self).init()
self.observation_matrix = self.observation_matrix[:,:,0]
Environment
pip freeze --all
: