alex-petrenko / sample-factory

High throughput synchronous and asynchronous reinforcement learning
https://samplefactory.dev
MIT License
773 stars 106 forks source link

Using RNN #292

Open anirjoshi opened 4 months ago

anirjoshi commented 4 months ago

Is there any example that shows the use of RNN with RL?

anirjoshi commented 4 months ago

In particular I have the following custom made gym environment and would like to use some RL based algorithm to solve it, is it possible? Some help in this regards would be helpful. Also note this this environment has variable size inputs.

class ModuloComputationEnv(gym.Env):
    """Environment in which an agent must learn to output mod 2,3,4 of the sum of
       seen observations.

    Observations are squences of integer numbers ,
    e.g. (1,3,4,5)

    The action space is just 3 values first for the sum of inputs till now %2, second %3 
    and third %4.

    Rewards are r=-abs(self.ac1-action[0]) - abs(self.ac2-action[1]) - abs(self.ac3-action[2]), 
    for all steps.
    """

    def __init__(self, config):

        #the input sequence can have any number from 0,99
        self.observation_space = Sequence(Discrete(100), seed=2)

        #the action is a vector of 3, [%2, %3, %4], of the sum of the input sequence
        self.action_space = MultiDiscrete([2,3,4])

        self.cur_obs = None

        #this variable maintains the episode_length
        self.episode_len = 0

        #this variable maintains %2
        self.ac1 = 0

        #this variable maintains %3
        self.ac2 = 0

        #this variable maintains %4
        self.ac3 = 0

    def reset(self, *, seed=None, options=None):
        """Resets the episode and returns the initial observation of the new one.
        """

        # Reset the episode len.
        self.episode_len = 0

        # Sample a random sequence from our observation space.
        self.cur_obs = self.observation_space.sample()

        #take the sum of the initial observation
        sum_obs = sum(self.cur_obs)

        #consider the %2, %3, and %4 of the initial observation
        self.ac1 = sum_obs%2
        self.ac2 = sum_obs%3
        self.ac3 = sum_obs%4

        # Return initial observation.
        return self.cur_obs, {}

    def step(self, action):
        """Takes a single step in the episode given `action`

        Returns:
            New observation, reward, done-flag, info-dict (empty).
        """
        # Set `truncated` flag after 10 steps.
        self.episode_len += 1
        truncated = False
        terminated = self.episode_len >= 10

        #the reward is the negative of further away from computing the individual values
        reward = abs(self.ac1-action[0]) + abs(self.ac2-action[1]) + abs(self.ac3-action[2])
        reward = -reward

        # Set a new observation (random sample).
        self.cur_obs = self.observation_space.sample()

        #recompute the %2, %3 and %4 values
        self.ac1 = (self.cur_obs+self.ac1)%2
        self.ac2 = (self.cur_obs+self.ac2)%3
        self.ac3 = (self.cur_obs+self.ac3)%4

        return self.cur_obs, reward, terminated, truncated, {}
alex-petrenko commented 4 months ago

Hey @anirjoshi !

RNN policies are first-class citizens in Sample Factory. In fact, with the default configuration you will train an RNN (GRU) policy.

See these parameter descriptions in cfg.py or here https://www.samplefactory.dev/02-configuration/cfg-params/:

[--use_rnn USE_RNN] [--rnn_size RNN_SIZE]
[--rnn_type {gru,lstm}]
[--rnn_num_layers RNN_NUM_LAYERS]
anirjoshi commented 4 months ago

@alex-petrenko Thank you for your response! Is there any example that uses this? So, I can directly incorporate that example with my environment?

alex-petrenko commented 4 months ago

Hi @anirjoshi

literally any example would work since, again, this is a default configuration.

you can start by reading these tutorials: https://www.samplefactory.dev/03-customization/custom-environments/

https://samplefactory.dev/03-customization/custom-models/