facebookresearch / phyre

PHYRE is a benchmark for physical reasoning.
https://phyre.ai
Apache License 2.0
430 stars 61 forks source link

Question about the DQN model #14

Closed Esaada closed 4 years ago

Esaada commented 4 years ago

Hi, first I'd like to express my appreciation to your work, this kind of datasets really push ai forward. I have one small question, I have difficult time to understand what is the relation between your DQN and a traditional DQN. I mean DQN gets a state and outputs what is the most probable action to take while your DQN already randomly sample an action and approximate the reward.

Please help me fill in the gap I have. Thanks.

akhti commented 4 years ago

The idea behind DQN is to approximate Q-value function with a neural network using Bellman backup. Q-value function maps a state-action pair to an estimated cumulative reward till the end of the episode. In case of phyre, each action leads into final state with either positive (solved) or negative reward (not solved). Therefore, DQN objective becomes nothing more than supervised learning from a state-action pair to the reward.

During inference vanilla DQN takes the most probable action. However, in our case the number of actions is infinite and so it's impossible to take the max. Instead we sample some random set of actions and take argmax over this set.

Hope this helps :)

Esaada commented 4 years ago

Thanks, One more question, at inference time you said you are sampling R random actions, there is an option that in the entire R selected actions there is not even one action that solves the task, or I'm missing something? There is something a bit not clear in the phrase of "ranking", our space has 100k discrete actions, Now we need to choose R actions, is it randomly or using some ranking? this confuses me because after the Neural network each of the chosen action is getting a rank. What is Wilcoxon signed-rank in section 4.2 in the paper refereed to?

akhti commented 4 years ago

Yes, we select some random number R of actions and it’s possible that some tasks are not solvable with the set. However, we found that almost all the tasks could be solved with 100k actions (see Fig2a in the paper). Also, in Fig 4 we show that our baseline agents are far beyond what could have achieved with the same set of action with optimal ranking.

We use implementation of the test from scipy. You can uses this script to compare two agents: https://github.com/facebookresearch/phyre/blob/master/agents/compare.py