DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.35k stars 1.6k forks source link

Discrepancy between Observations Sampled from Gym Env and Replay Buffer #1909

Closed AOAA96 closed 2 months ago

AOAA96 commented 2 months ago

šŸ› Bug

The element in index number 2 in the next observation is the action selected by the TD3 agent based on the current observation. For example, if the current observation is [0, 0.23, 0.45, 0.85], and the action is 0.55, the next observation would be something like [0.1, 0.25, 0.55, 0.80]. This works well when I test a step in the environment, and I have provided an image that demonstrates this. Check the image called "checking_env". Notice how the actions correctly correspond to the element in index number 2 in the next observation. I have color-coded for ease of reading.

However, the same is not observed in the replay buffer samples. In the replay buffer samples, the action selected by the agent does not correspond to the element in index number 2 in the next observation. I believe this is somehow affecting the learning of the agent. Check the image called "replaybuffer". The element in index number 2 of the observation is also always clipped between 0 and 1. Notice how the fourth action in the replay buffer is not clipped to zero in the fourth "next_observation".

Just to rule out that it is not noise causing this discrepancy in the replay buffer, the action_noise argument in the TD3 model was kept as None.

I would love to share more code, but it will involve some sensitive information that I do not know yet if I can share.

checking_env replaybuffer

Code example

Action and observation spaces from init.

   self.action_space = spaces.Box(low= 0.0 , high=1.0, shape=(1,), dtype=np.float32)

   self.observation_space = spaces.Box(
       low=np.array([0., 0., 0., 0.]),
       high=np.array([1., 1., 1., 1.]),
       shape=(4,),
       dtype=np.float32)

Step and Reset methods.

   def step(self, action):

        action = np.clip(action, 0.0, 1.0, dtype=np.float32)

        el1_new = np.float32(action[0])

        reward = np.float32(0.)
        terminated = False

        next_time = np.float32(self.profile_data.iloc[self.current_time_idx + 1, 1]/self.max_time)
        next_pd = np.float32(self.profile_data.iloc[self.current_time_idx + 1, 7])

        deltap_old = self._engineload*4770 - self._pdem*5400.
        deltap_new = el1_new*4770 - next_pd*5400.

        if deltap_new < -800. or deltap_new > 2000.:
            reward -= np.float32(1.)

        ps_new = np.float32(self.bat_soc(deltap_old, deltap_new, self._batterysoc))

        if ps_new > 0.85 or ps_new < 0.30:
            reward -= np.float32(1.)

        ghg_rew = self.ghg_reward(el1_new, self._engineload)
        reward += ghg_rew

        self._tfrac = next_time
        self._pdem = next_pd
        self._engineload = np.clip(el1_new,0.,1., dtype=np.float32)
        self._batterysoc = np.clip(ps_new,0.,1., dtype=np.float32)

        observation = np.array([self._tfrac, self._pdem, self._engineload, self._batterysoc], dtype = np.float32)

        self.current_time_idx += 1

        if next_time == 1.:

            final_soc = self._batterysoc

            while final_soc < 0.85:

                final_soc = np.float32(self.bat_soc(2000., 2000., final_soc))
                reward -= np.float32(60.7/64.0)

            terminated = True

        info = self._get_info()
        reward = float(reward)

        return observation, reward, terminated, False, info

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)

        self.current_profile = random.sample(self.trips,1)[0]
        self.profile_data = self.data[self.data['trip_number'] == self.current_profile]

        soc_init = np.round(random.uniform(0.75, 0.85),2)
        self._tfrac = 0.
        self._pdem = np.float32(self.profile_data.iloc[0,7])
        self._engineload = np.float32((self.profile_data.iloc[0,2]+self.profile_data.iloc[0,3])/4770.)
        self._batterysoc = np.float32(soc_init)
        self.current_time_idx = 0
        self.max_time = max(self.profile_data['index'])

        observation = np.array([self._tfrac, self._pdem, self._engineload, self._batterysoc], dtype = np.float32)
        info = self._get_info()

        return observation, info

Relevant log output / Error message

No response

System Info

No response

Checklist

AOAA96 commented 2 months ago

@araffin sorry I am not clear on what I am missing from the checklist.

araffin commented 2 months ago

The provided code is not minimal or working (please check the link for an explanation) and you should solve the env checker warnings first.

AOAA96 commented 2 months ago

Changing the action space to be between -1 and 1 resolved the issue.