Open anirjoshi opened 4 months ago
In particular I have the following custom made gym environment and would like to use some RL based algorithm to solve it, is it possible? Some help in this regards would be helpful. Also note this this environment has variable size inputs.
class ModuloComputationEnv(gym.Env):
"""Environment in which an agent must learn to output mod 2,3,4 of the sum of
seen observations.
Observations are squences of integer numbers ,
e.g. (1,3,4,5)
The action space is just 3 values first for the sum of inputs till now %2, second %3
and third %4.
Rewards are r=-abs(self.ac1-action[0]) - abs(self.ac2-action[1]) - abs(self.ac3-action[2]),
for all steps.
"""
def __init__(self, config):
#the input sequence can have any number from 0,99
self.observation_space = Sequence(Discrete(100), seed=2)
#the action is a vector of 3, [%2, %3, %4], of the sum of the input sequence
self.action_space = MultiDiscrete([2,3,4])
self.cur_obs = None
#this variable maintains the episode_length
self.episode_len = 0
#this variable maintains %2
self.ac1 = 0
#this variable maintains %3
self.ac2 = 0
#this variable maintains %4
self.ac3 = 0
def reset(self, *, seed=None, options=None):
"""Resets the episode and returns the initial observation of the new one.
"""
# Reset the episode len.
self.episode_len = 0
# Sample a random sequence from our observation space.
self.cur_obs = self.observation_space.sample()
#take the sum of the initial observation
sum_obs = sum(self.cur_obs)
#consider the %2, %3, and %4 of the initial observation
self.ac1 = sum_obs%2
self.ac2 = sum_obs%3
self.ac3 = sum_obs%4
# Return initial observation.
return self.cur_obs, {}
def step(self, action):
"""Takes a single step in the episode given `action`
Returns:
New observation, reward, done-flag, info-dict (empty).
"""
# Set `truncated` flag after 10 steps.
self.episode_len += 1
truncated = False
terminated = self.episode_len >= 10
#the reward is the negative of further away from computing the individual values
reward = abs(self.ac1-action[0]) + abs(self.ac2-action[1]) + abs(self.ac3-action[2])
reward = -reward
# Set a new observation (random sample).
self.cur_obs = self.observation_space.sample()
#recompute the %2, %3 and %4 values
self.ac1 = (self.cur_obs+self.ac1)%2
self.ac2 = (self.cur_obs+self.ac2)%3
self.ac3 = (self.cur_obs+self.ac3)%4
return self.cur_obs, reward, terminated, truncated, {}
Hey @anirjoshi !
RNN policies are first-class citizens in Sample Factory. In fact, with the default configuration you will train an RNN (GRU) policy.
See these parameter descriptions in cfg.py or here https://www.samplefactory.dev/02-configuration/cfg-params/:
[--use_rnn USE_RNN] [--rnn_size RNN_SIZE]
[--rnn_type {gru,lstm}]
[--rnn_num_layers RNN_NUM_LAYERS]
@alex-petrenko Thank you for your response! Is there any example that uses this? So, I can directly incorporate that example with my environment?
Hi @anirjoshi
literally any example would work since, again, this is a default configuration.
you can start by reading these tutorials: https://www.samplefactory.dev/03-customization/custom-environments/
Is there any example that shows the use of RNN with RL?