chncyhn / flappybird-qlearning-bot

Flappy Bird Bot using Reinforcement Learning
MIT License
416 stars 94 forks source link

Question about rewards #4

Closed Akababa closed 4 years ago

Akababa commented 6 years ago

Why is the reward/objective function = score-1000 instead of just score? Does it encourage exploration, and if so what advantages does it have vs. initializing Q(s,a) with +1000?

If you plug in a -1000 bias to Q in the update rule with gamma=1, it all cancels out. Q[s,a] ← Q[s,a] + α (r + γ*V(s') - Q[s,a])

Thanks for this instructive project!

chncyhn commented 6 years ago

Hello @Akababa!

The rewards are 1 if the bird lives, and -1000 if the bird dies.

See the reward function (which is simply a dictionary with 2 key-value pairs) in bot.py:

self.r = {0: 1, 1: -1000} # Reward function

Where did you see that the reward is score-1000?

Akababa commented 6 years ago

If you add them all up for a single run it ends up being final score - 1000 (at least for discount rate=1). So according to your code, if my flappy bird agent scored 100 I would update all the previous values with value=(1-alpha)value+alpha(-900).

chncyhn commented 6 years ago

Actually the "score" of the game is not used. The rewards of 1 are received after every frame if the bird is still alive.

I do understand your point that the -1000 reward might not be necessary. Indeed just the positive rewards might have been enough. Though I am not sure if the convergence would still happen rapidly in practice. It would be an interesting experiment to try out.

The fact is the reward function is quite arbitrary. 1 for living and -1000 for death. There are infinitely many combinations possible, and as long as reward for living is sufficiently higher than death it should work out in practice.