Open zhiyiZeng opened 2 years ago
I tries to make learning rate smaller, but it seems that the performance gets worse. Therefore, I don't think learning rate is the issue here.
I think the model learns something, but that something doesn't reflect by the rewards. I wonder why is that.
This repo is pretty awesome. I'm trying to run a basic demo, but the training process seems not working at all (rewards don't converge at all). However, the agent still outperforms B&H a lot.(even when the reward is negative! ) I'm confused by this situation. Is there an explanation about this?
The graph is rewards with training epochs=50.