AI4Finance-Foundation / FinRL

FinRL: Financial Reinforcement Learning. 🔥
https://ai4finance.org
MIT License
9.36k stars 2.27k forks source link

Wrong reward at terminal date when training with SB3 #1228

Open iezfly opened 1 month ago

iezfly commented 1 month ago

Describe the bug For StockTradingEnv when training in SB3 VecEnv (using get_sb_env), at the terminal date(let's say day X), it doesn't calculate reward since training data doesn't contain the next date closed price. But when SB3 algorithm conducts collect_rollouts, it adds the previous reward (day X-1's reward) into rollout_buffer.

To Reproduce Steps to reproduce the behavior:

  1. pick a short period training data to quickly reach terminal date when debuging , eg. 5days in total.
  2. break point after rollout_buffer.add() in on_policy_algorithm.py(SB3 module) line 232
  3. stepping until after day 5 (add the 4th experience)
  4. see the rollout_buffer variable (the reward[3] is the same as the previous one)

Expected behavior after I examined SB3 code, it provides a solution for calculating the reward when next state is unobservable in episode. it use prediction value with discount as reward, while I'm not sure about its mathematic meaning.

            # see GitHub issue #633
            for idx, done in enumerate(dones):
                if (
                    done
                    and infos[idx].get("terminal_observation") is not None
                    and infos[idx].get("TimeLimit.truncated", False)
                ):
                    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
                    with th.no_grad():
                        terminal_value = self.policy.predict_values(terminal_obs)[0]  # type: ignore[arg-type]
                    rewards[idx] += self.gamma * terminal_value

But it requires the Env return "terminated = False, truncated = True." (while StockTradingEnv return "terminated = True, truncated = False:) # In SB3, truncated and terminated are mutually exclusive.

I think in stock trading scenario, the last day in training data should be a time limit truncation, but not a terminal, because it has future reward (value), unless it holds zero shares and the policy will not buy anymore at that state.

Screenshots image

Additional context I simply change the return in terminal case in the StockTradingEnv code, line 300. return self.state, 0.0, False, True, {}. # reward = 0, could cause some issue

but I didn't consider the impact on ElegantRL, Ray and Portfolio Management scene.