choo8 / Tensorflow-DeepMind-Atari-Deep-Q-Learner-2Player

A code reimplementation of DeepMind's "Multiagent Cooperation and Competition with Deep Reinforcement Learning" with Tensorflow
16 stars 5 forks source link

Pong2Player environment usage #1

Open wweichn opened 6 years ago

choo8 commented 6 years ago

Hey, sorry for the late reply. We are still working on this project so it won't be complete for another month or so. I saw that you are interested in an API for 2 player pong game. You can actually look at the API of Xitari2Player, which I ported over into Python at https://github.com/choo8/Xitari2Player. The API calls there should be able to let you give actions to 2 players in Pong.

wweichn commented 6 years ago

Hi, thanks for your response. I tried to run main_2.py in training branch , and if you edit code in 'history.py', its okay to run it. just change else sentence in get function to np.transpose(self.history, (1,2,0))

choo8 commented 6 years ago

Ok, let me know if you have more questions on getting the Pong2Player code running

wweichn commented 6 years ago

thks, I am trying to rewrite a version based on your work, because I am not familiar with the use of ale etc. And I can't understand your definition of reward in agent.py def observe(self, screen, reward, action, terminal): reward = max(self.min_reward, min(self.max_reward, reward)) Do you mind explaining this to me. In my view, reward = reward is okay

choo8 commented 6 years ago

I believe this is useful if you want to perform clipping of rewards. You could also do reward = reward, it should work as well

wweichn commented 6 years ago

The reward is gotten by ale.ale_getRewardA() or ale.ale_getRewardB(). Can you tell me the range of reward? And I have read

Multiagent cooperation and competition with deep reinforcement learning

In this essay the reward is between -1 and 1. Thks a lot.

choo8 commented 6 years ago

Yes, I believe it is within the range of -1 and 1. The values, decided by the rom used, should be as described in the paper "Multiagent Cooperation and Competition with Deep Reinforcement Learning".

wweichn commented 6 years ago

sorry to bother you again. why the action is among [0,1,3,4] for agent 1 and [20,21,23,24] for agent2. I know there are four possible actions [None, fire, up, down]. Is this defined in roms/Pong2Player025.bin? Thks.

choo8 commented 6 years ago

This is actually an implementation of the Xitari2Player environment. You can see the full list of actions at https://github.com/choo8/Xitari2Player/blob/master/ale_interface.hpp. I only included the 4 relevant actions in the training script.

wweichn commented 6 years ago

Thks. And where can I find the definition of ale.ale_isGameOver. I found it's hard to achieve a state where ale.ale_isGameOver is true. And by my observation, is it right that one side first got 20 points means Gameover of one epoch?

choo8 commented 6 years ago

According to the paper, a game of Pong ends when 21 points is scored by either agent. Epochs are determined by number of iterations, where 250000 iterations would equal to one epoch. This hyperparameter are also the ones used in the original paper.

wweichn commented 6 years ago

Thks a lot. I think there might be some mistakes in your code about action. The range is between [20,21,23,24], take agent2 as example, but the output of network is [0,1,2,3], so it needs mapping from [0,1,2,3] to [20,21,23,24]. Take [0,1,2,3] as index of [20,21,23,24] is okay. And the exploration rate changes with agent.step, but at the beginning of a new epoch, the agent.step
starts from 0, but the exploration rate shouldn't begin from ep_start anymore, it should continue changing with the value of last step in the last epoch.