chncyhn / flappybird-qlearning-bot

Flappy Bird Bot using Reinforcement Learning
MIT License
416 stars 94 forks source link

how to discrete the state #5

Closed ProQianXiao closed 4 years ago

ProQianXiao commented 6 years ago

Hi,Cihan Ceyhan. I have learned your code of flappybird-qlearning-bot, it's excellent. But I have one questions about these codes, could you help me?

how to discrete the state? ` def map_state(self, xdif, ydif, vel): ''' Map the (xdif, ydif, vel) to the respective state, with regards to the grids The state is a string, "xdif_ydif_vel"

    X -> [-40,-30...120] U [140, 210 ... 420]
    Y -> [-300, -290 ... 160] U [180, 240 ... 420]
    '''
    if xdif < 140:
        xdif = int(xdif) - (int(xdif) % 10)
    else:
        xdif = int(xdif) - (int(xdif) % 70)

    if ydif < 180:
        ydif = int(ydif) - (int(ydif) % 10)
    else:
        ydif = int(ydif) - (int(ydif) % 60)

    return str(int(xdif))+'_'+str(int(ydif))+'_'+str(vel)`

how do you determine the boundary of X and Y? Thank you very much.

chncyhn commented 6 years ago

Hello @ProQianXiao , thanks for your comment!

The boundaries of the states come from the way they are calculated.

xdif is the horizontal distance from the next pipe, calculated by next pipe's left corner's x coordinate - bird's x coordinate. The maximum distance observed for this is at the very start of the game, where the bird has to travel quite a bit of distance to reach the first pipe; at this point the xdif is around 430. This never again happens after we proceed between the pipes, as the maximum possible distance then is always less than 140. For this reason, I discretized xdif values higher than 140 into large bins of size 70 as you can see. So for instance, all the xdif values between 140 and 210 are binned to 140 via the calculation xdif - (xdif % 70). My goal was to increase the speed of convergance, since after the bird is able to reach the first pipe, these states will never be visited again during the game, so there is no reason to discretize them into small bins.

And another point is the negative values possible for the xdif (-40, -30, -20, .. ). They occur when the bird is exactly on top of a pipe, since at that moment the bird's x coordinate is greater than the pipe's left corner's x coordinate. The pipe used in the calculations changes to the next coming pipe after the difference is lower than -30, which is when the bird is about to come out of the current pipe.

Similar ideas hold for the ydif.

Cihan

ProQianXiao commented 6 years ago

@chncyhn ,thank you very much again, that helps me a lot. Now I understand your ideas, it's excellent, and it can achieve convergence quickly, that's great.

ProQianXiao commented 6 years ago

@chncyhn , sorry to bother you again. I encountered some problems when I modified your code. I want to apply sarsa to update the qvalues, so I made some modifications to the update_scores function. Here is the code : `def update_scores(self):

    history = list(reversed(self.moves)) 

    # Flag if the bird died in the top pipe
    high_death_flag = True if int(history[0][2].split('_')[1]) > 120 else False

    # Q-learning score updates
    t = 1
    res_act = history[0][1] #Modification:the real action that bird took, instead of the max qvalues action
    for exp in history:
        state = exp[0]
        act = exp[1]
        res_state = exp[2]
       # Modification:
        if t==1:
            self.qvalues[state][act] = (1 - self.lr) * (self.qvalues[state][act]) + (self.lr) * (self.r[1])
        elif t==2:
            self.qvalues[state][act] = (1- self.lr) * (self.qvalues[state][act]) + (self.lr) * ( self.r[1] + (self.discount)*(self.qvalues[res_state][res_act]))
        elif high_death_flag and act:
            self.qvalues[state][act] = (1- self.lr) * (self.qvalues[state][act]) + (self.lr) * ( self.r[1] + (self.discount)*(self.qvalues[res_state][res_act]))
            high_death_flag = False
        else:
            self.qvalues[state][act] = (1- self.lr) * (self.qvalues[state][act]) + (self.lr) * ( self.r[0] + (self.discount)*(self.qvalues[res_state][res_act]))
        t += 1
        res_act = act

    self.gameCNT += 1 # increase game count
    self.dump_qvalues() # Dump q values (if game count % DUMPING_N == 0)
    self.moves = []  # clear history after updating strategies`

After the modification, I run the program, the result is bad. I trained 20 thousands times, and the max score is only 5. Could you tell me where is wrong? Thanks very much.

chncyhn commented 6 years ago

Hey @ProQianXiao, frankly I am not too aware of the details of SARSA algorithm, but you should probably tune the hyperparameters like learning rate/reward structure/discount rate a bit for it. For instance, I have just tried your modification with learning rate changed to 0.25 and crashing reward to -100, and now it's able to hit 100+ scores.

image

By the way, if you are not aware, you can change the FPS variable in flappy.py (which is set to 60 by default) to a very high number during training to make it run faster.

ProQianXiao commented 6 years ago

@chncyhn thank you very much,and your suggestion is great, I tune the hyperparameters,with learning rate changed to 0.25 and crashing reward to -100 as you said, it can hit a higher scores, but there is still a big gap with your code. I will make more attempts about the hyperparameters.

In addition, there is something about your code that I am not very sure, I do not know whether my understanding is correct.

` def update_scores(self): """ Update qvalues via iterating over experiences """ history = list(reversed(self.moves))

    # Flag if the bird died in the top pipe
    high_death_flag = True if int(history[0][2].split('_')[1]) > 120 else False

    # Q-learning score updates
    t = 1
    for exp in history:
        state = exp[0]
        act = exp[1]
        res_state = exp[2]

        # Select reward
        if t == 1 or t == 2:
            cur_reward = self.r[1]
        elif high_death_flag and act:
            cur_reward = self.r[1]
            high_death_flag = False
        else:
            cur_reward = self.r[0]

        # Update
        self.qvalues[state][act] = (1-self.lr) * (self.qvalues[state][act]) + \
                                    self.lr * ( cur_reward + self.discount*max(self.qvalues[res_state]) )
        t += 1

    self.gameCNT += 1  # increase game count
    self.dump_qvalues()  # Dump q values (if game count % DUMPING_N == 0)
    self.moves = []  # clear history after updating strategies`

when t==1 or t==2, set the cur_reward=-1000. My thought is that t==1 or t==2 represents the two states before crashing, the game is a continuous process, the two states before crashing have a relatively important relationship with crashing state(they caused the crashing) compared with other states, so set the reward = -1000 no matter what action it took.

And when it satisfies 'high_death_flag and act' , it means that the bird crashed on the upper pipe, so the states before crashing that took 'flap' action should be punished, set the reward=-1000.

That's my understanding, am I right? Looking forward to your reply.

chncyhn commented 6 years ago

@ProQianXiao yes your understanding is mostly correct. (except the "...no matter what action it took" statement, because we punish the (state,action) pairs, not just the state; note that we update self.qvalues[state][act]).

Another small difference is that you said "... it means that the bird crashed on the upper pipe, so the states before crashing that took 'flap' action should be punished", but only the last state where 'flap' action is taken is punished, specifically the last (state, action) pair where action is 'flap'.

I haven't looked much into it, but in this thread there is a good comparison of Q-learning vs SARSA which you might find useful.

ProQianXiao commented 6 years ago

@chncyhn Thanks for your reply, now I have a deeper understanding. I will read the 'comparison of Q-learning vs SARSA ' carefully, thanks for your help again, that means a lot for me .