TRPO and PPO Models don't train

IBM / rl-testbed-for-energyplus

Reinforcement Learning Testbed for Power Consumption Optimization using EnergyPlus

MIT License

190 stars 77 forks source link

TRPO and PPO Models don't train #84

Open yashviagrawal opened 2 years ago

yashviagrawal commented 2 years ago

@antoine-galataud @takaomoriyama Hello, I've been working on this project for a long time and I've trained the model using the TRPO policy for over 2000 epochs, but the reward would get stabalised early on only. Then I switched to the PPO policy, and it showed great progress when I trained it for 250 epochs where the reward went from -2 lakhs to -20000, but after that inspite of running it for more 300 epochs, the reward didn't drop and it stabalised. Please help me figure out why is that?

antoine-galataud commented 2 years ago

Hi @yashviagrawal It's hard to tell what's going wrong without more details, and at first sight it looks like it's converging since mean episode reward is increasing then stabilizing. What are your expectations?

yashviagrawal commented 2 years ago

@antoine-galataud In the graph image shown in the repository, it shows that the RL agent reached its goal of optimising the power consumption and everything in just 300 epochs. But inspite of me running the code for 2000 epochs using TRPO policy, it still didn't optimise it. And, the same with PPO policy.

My expectations are for the RL agent to learn properly without instantly stabalising.

I have attached the image of the graph of the PPO model trained for 200 epochs, where the RL agent has been unable to achieve the goal:

antoine-galataud commented 2 years ago

@yashviagrawal thanks for sharing some results. Did you try to run an experiment from master sources (using TRPO, and without any changes)?

yashviagrawal commented 2 years ago

@antoine-galataud Hello, yes I did try to run the orginal code with the TRPO policy. I wanted to ask how do I know what the action space was?