Closed Esaada closed 4 years ago
The idea behind DQN is to approximate Q-value function with a neural network using Bellman backup. Q-value function maps a state-action pair to an estimated cumulative reward till the end of the episode. In case of phyre, each action leads into final state with either positive (solved) or negative reward (not solved). Therefore, DQN objective becomes nothing more than supervised learning from a state-action pair to the reward.
During inference vanilla DQN takes the most probable action. However, in our case the number of actions is infinite and so it's impossible to take the max. Instead we sample some random set of actions and take argmax over this set.
Hope this helps :)
Thanks, One more question, at inference time you said you are sampling R random actions, there is an option that in the entire R selected actions there is not even one action that solves the task, or I'm missing something? There is something a bit not clear in the phrase of "ranking", our space has 100k discrete actions, Now we need to choose R actions, is it randomly or using some ranking? this confuses me because after the Neural network each of the chosen action is getting a rank. What is Wilcoxon signed-rank in section 4.2 in the paper refereed to?
Yes, we select some random number R of actions and it’s possible that some tasks are not solvable with the set. However, we found that almost all the tasks could be solved with 100k actions (see Fig2a in the paper). Also, in Fig 4 we show that our baseline agents are far beyond what could have achieved with the same set of action with optimal ranking.
We use implementation of the test from scipy. You can uses this script to compare two agents: https://github.com/facebookresearch/phyre/blob/master/agents/compare.py
Hi, first I'd like to express my appreciation to your work, this kind of datasets really push ai forward. I have one small question, I have difficult time to understand what is the relation between your DQN and a traditional DQN. I mean DQN gets a state and outputs what is the most probable action to take while your DQN already randomly sample an action and approximate the reward.
Please help me fill in the gap I have. Thanks.