questions on Q-learning applied for racetrack

xubo92 commented 6 years ago

hi @byrdie : I have tried implement standard Q-learning and Sarsa algorithm on a simple racetrack map. And it's very strange that while Sarsa converge very quick but Q-learning is so hard to converge in nearly same iterations(5000~10000) under same environment. WHY??? The code for generating episodes is always the same. Is my implementation wrong? I checked many times and didn't find the flaws. I am very confused with this situation. Would you please give me some suggestions or have you ever met such conditions? To describe my Q-learning implementation with more detail, I will simply talk about a little. I initialize two policy with tabular form. one is for generating episode, named 'mu',it's e-greedy; another is target policy,named 'pi',it's deterministic. In every step of one episode, I keep updating 'mu' every time after I updating current Q(s,a), and set the policy 'pi' with greedy form. Look forward to your help or suggestions. Here is my Q-learning code:

` def Q_learning(self,agent,episode_num,epsilon,alpha,gamma,max_timestep,eval_interval):

    ep_idx = 0
    avg_ep_return_list = []
    while ep_idx < episode_num:

            '''evaluation part: Every 'eval_interval' episodes, use target policy pi generate one episode to see the total return'''
        if ep_idx % eval_interval == 0:
            eval_ep = agent.episode_generator(self.pi,max_timestep,False)
            print("eval episode length:%d" %(len(eval_ep)/3))
            c_avg_return = agent.avg_return_per_episode(eval_ep)
            avg_ep_return_list.append(c_avg_return)
            print("assessing return:%f" %c_avg_return)
            print "avg return list length:",len(avg_ep_return_list)

        ep_idx += 1

        agent.c_state = agent.getInitState()
        agent.next_state = agent.c_state

        n = 0
        while n < max_timestep:

            agent.c_state  = agent.next_state

            c_action_idx = np.random.choice(self.action_num,1,p=self.mu[agent.c_state])[0]

            agent.c_state, agent.c_action, agent.c_reward, agent.next_state = agent.oneStep_generator()

            if agent.isTerminated():
                Qmax = 0
            else:
                Qmax = np.amax(self.Q[agent.next_state])

            self.Q[agent.c_state][c_action_idx] += alpha * (
            agent.c_reward + gamma * Qmax - self.Q[agent.c_state][c_action_idx])

            c_best_action_idx = np.argmax(self.Q[agent.c_state])

                            # --------behavior policy update at each step---------#
            for action_idx in range(self.action_num):
                if action_idx == next_best_action_idx:
                    self.mu[agent.c_state][action_idx] = 1 - epsilon + epsilon/self.action_num
                else:
                    self.mu[agent.c_state][action_idx] = epsilon/self.action_num

            # --------target policy update at each step---------#
            for action_idx in range(self.action_num):
                if action_idx == c_best_action_idx:
                    self.pi[agent.c_state][action_idx] = 1.0
                else:
                    self.pi[agent.c_state][action_idx] = 0.0

            n += 1

    return avg_ep_return_list`

byrdie commented 6 years ago

Hi @lvlvlvlvlv,

Unfortunately I don't have time today to help you debug your code. However, if you refer to the final report I prepared for this project, you can see that the Q-Learning algorithm took a very long time to train. In the paper I discuss how It was prohibitive to train Q-Learning on the entire racetrack, I instead trained the agent incrementally, initially starting only a few cells from the finish line and then slowly starting the agent further back.

I'm not sure what the configuration of your racetrack is, but it seems completely reasonable to me that you need to wait longer for convergence.

Let me know how it goes, byrdie

xubo92 commented 6 years ago

hi @byrdie : Sorry for my late reply. Thank you very much for your explanation and your final report. it give me some baselines and inspiration. The code I wrote earlier contains several flaws but I didn't find them before. Now the Q-learning algo is working fine. I check the result and find that the speed of sarsa and Q-learning on my racetrack task are almost same. But Q-learning is more stable in the beginning part without so much viberations. Maybe my racetrack map is too simple, so there are not obvious differences in speed. Thank you again for your kind help and encouragement!

byrdie / CSCI446_Artificial_Intelligence_Project4_ReinforcementLearning

questions on Q-learning applied for racetrack #1