Unpredictable Rewards in Stable Baseline 3 DDPG and TD3 Models - Seeking Clarification

AI4Finance-Foundation / FinRL

FinRL: Financial Reinforcement Learning. 🔥

https://ai4finance.org

MIT License

9.38k stars 2.28k forks source link

Unpredictable Rewards in Stable Baseline 3 DDPG and TD3 Models - Seeking Clarification #1138

Open tkay264 opened 7 months ago

tkay264 commented 7 months ago

Hello,

Thank you for creating the library, and I appreciate your excellent work.

I've been experimenting with the Stable Baseline 3 DDPG and TD3 Models. When I run the training script, I'm experiencing unpredictable rewards, sometimes they calculate correctly, and other times they stay at 0. If I stop and rerun the script, the rewards may still be 0. Could you clarify if this is an inherent issue with the models or if I should review the code?

I'm using one week of 1-minute data for a single stock in my training.

Best regards

zhumingpassional commented 6 months ago

thanks for your interests for this project. you can list all states and actions, and then calculate the reward again. if the reward is 0, i guess the action is hold, no buying or selling.

robzsaunders commented 6 months ago

@zhumingpassional Are you suggesting that this output:

---------------------------------------
| time/                   |           |
|    fps                  | 328       |
|    iterations           | 27        |
|    time_elapsed         | 336       |
|    total_timesteps      | 110592    |
| train/                  |           |
|    approx_kl            | 0.0       |
|    clip_fraction        | 0         |
|    clip_range           | 0.2       |
|    entropy_loss         | -2.17e-18 |
|    explained_variance   | 0         |
|    learning_rate        | 0.125     |
|    loss                 | 1.31      |
|    n_updates            | 260       |
|    policy_gradient_loss | -3.22e-09 |
|    reward               | 0.0       |
|    value_loss           | 2.84      |
---------------------------------------

the reward parameter is strictly for that step in time and not related to the current performance of the model?