Retrieving results from paper using run_experiment.py

Wonder1905 commented 4 years ago

Hi, I used run_experiment.py script in order the retrieve the results for dqn. I used it the following: python agents/run_experiment.py --use-test-split 1 --arg-generator base_dqn I've got for the cross template the following scores: Fold0 - 0.3587 Fold1 - 0.2079 Fold2 - 0.3287 Fold3 - 0.2956 Fold4 - 0.1978 Fold5 - 0.3928 Fold6 - 0.2538 Fold7 - 0.2892 Fold8 - 0.1529 Fold9 - 0.3756 The average is : 0.2853‬, which is lower from the results from the paper which is 36.8±9.7.

What I'm missing?

Thanks for your help.

akhti commented 4 years ago

base_dqn uses 10k action during evaluation. We did a sweep over action-to-rank on the validation set and found that the optimal number of actions depends on the generalization settings due to overcrowdness problem. See lines 145-150 in run_experiment.py:

    dqn_ranks = dict(
        ball_cross_template='--dqn-rank-size=1000',
        ball_within_template='--dqn-rank-size=10000',
        two_balls_cross_template='--dqn-rank-size=100000',
        two_balls_within_template='--dqn-rank-size=100000',
    )

To get the final results you need to run final arg-generator. It will take the pretrained DQN from base_dqn and use it to do eval with the optimal number of actions.

Wonder1905 commented 4 years ago

Let me get things straight, first you ran with those settings:

dqn_ranks = dict( ball_cross_template='--dqn-rank-size=1000', ball_within_template='--dqn-rank-size=10000', two_balls_cross_template='--dqn-rank-size=100000', two_balls_within_template='--dqn-rank-size=100000', )

Afterward, you ran another run to find the optimal amount of actions you better rank?

akhti commented 4 years ago

The full sequence of experiments is listed in agents/train_all_baseline.sh.

First we train a DQN on 3 dev-folds. Then we use then to rank different number of actions and measure AUCCESS:

python $RUN_EXPERIMENT_SCRIPT --use-test-split 0 --arg-generator base_dqn --num-seeds $DEV_SEEDS
python $RUN_EXPERIMENT_SCRIPT --use-test-split 0 --arg-generator rank_and_online_sweep --num-seeds $DEV_SEEDS

Then we manually chose the best number of actions to rank (see figure 4 in the paper). These values are used for get the final numbers:

python $RUN_EXPERIMENT_SCRIPT --use-test-split 1 --arg-generator base_dqn --num-seeds $FINAL_SEEDS
wait_for_results "results/final/$DQN_BASE_NAME" $FINAL_SEEDS
python $RUN_EXPERIMENT_SCRIPT --use-test-split 1 --arg-generator finals --num-seeds $FINAL_SEEDS

The first command trains DQN on the final (non-dev) folds (and ranks default 10k during evaluation, but it doesn’t matter as it’s ignored). The second command use the pre-trained checkpoints to rank optimal number of actions for each evaluation setting. It also evaluates other baseline algorithms (like MEM) on the final folds. That’s why these 2 commands are separate.

You can see what exactly each arg-generator command do in agents/run_experiment.py.

facebookresearch / phyre

Retrieving results from paper using run_experiment.py #15