Open supreme-gg-gg opened 1 week ago
Tried multiple approaches including:
Barely any improvements.
I have read through all relevant reddit, stack overflow, and pytorch/tf forum posts, nothing really resembles the problem we are facing, otherwise I have tried their suggestions and did not work well.
It seems like Q function is learning but reawrd is somehow oscillating. I doubt if something is suspiciously wrong with the reward function / code itself... maybe a stupid bug but I really can't find. Might also be a problem caused by visualisation and data logging??
I believe if others are also facing this issue, this is likely due to the algorithm's weakness. Either DQN just loses this one, or there are some optimization that can help but are very obscure. It may just be that we are not strong enough for brute force...
Secondly, this could be due to the sparse nature of rewards in Pong. In the CartPole environment, we rewarded the agent for each timestep that the pole was upright. This kind of feedback allowed for the agent to quickly distinguish between good and bad actions. We then modified the reward to be scaled with the pole angle for even more precise feedback. This is a different case for DQN however. DQN was first proposed as a general solution to solve all Atari game environments given an image input. As such, we aren’t able to assign more precise rewards, and the reward system used for Pong is either -1 or 1 for scoring a point. This means there are far fewer rewards the agent learns from in each play-through, and there are tons of frames in between each reward which makes it harder for the agent to grasp the future utility of an action.
Although DQN was the first model used for solving Atari games with image inputs, the fact remains that DQN does have a long training time and slow convergence rate.
Completely no idea what is wrong, check the reward and Q function graph. Sometimes you stumble upon a functional agent that moves well or seem to chase the ball, but it is highly unstable.
https://arxiv.org/pdf/1312.5602
I've tried reading over the research paper and checked all the theories and hyper parameter and algorithm, nothing seem very out of place though...