Closed Wonder1905 closed 4 years ago
base_dqn uses 10k action during evaluation. We did a sweep over action-to-rank on the validation set and found that the optimal number of actions depends on the generalization settings due to overcrowdness problem. See lines 145-150 in run_experiment.py:
dqn_ranks = dict(
ball_cross_template='--dqn-rank-size=1000',
ball_within_template='--dqn-rank-size=10000',
two_balls_cross_template='--dqn-rank-size=100000',
two_balls_within_template='--dqn-rank-size=100000',
)
To get the final results you need to run final
arg-generator
. It will take the pretrained DQN from base_dqn
and use it to do eval with the optimal number of actions.
Let me get things straight, first you ran with those settings:
dqn_ranks = dict( ball_cross_template='--dqn-rank-size=1000', ball_within_template='--dqn-rank-size=10000', two_balls_cross_template='--dqn-rank-size=100000', two_balls_within_template='--dqn-rank-size=100000', )
Afterward, you ran another run to find the optimal amount of actions you better rank?
The full sequence of experiments is listed in agents/train_all_baseline.sh.
First we train a DQN on 3 dev-folds. Then we use then to rank different number of actions and measure AUCCESS:
python $RUN_EXPERIMENT_SCRIPT --use-test-split 0 --arg-generator base_dqn --num-seeds $DEV_SEEDS
python $RUN_EXPERIMENT_SCRIPT --use-test-split 0 --arg-generator rank_and_online_sweep --num-seeds $DEV_SEEDS
Then we manually chose the best number of actions to rank (see figure 4 in the paper). These values are used for get the final numbers:
python $RUN_EXPERIMENT_SCRIPT --use-test-split 1 --arg-generator base_dqn --num-seeds $FINAL_SEEDS
wait_for_results "results/final/$DQN_BASE_NAME" $FINAL_SEEDS
python $RUN_EXPERIMENT_SCRIPT --use-test-split 1 --arg-generator finals --num-seeds $FINAL_SEEDS
The first command trains DQN on the final (non-dev) folds (and ranks default 10k during evaluation, but it doesn’t matter as it’s ignored). The second command use the pre-trained checkpoints to rank optimal number of actions for each evaluation setting. It also evaluates other baseline algorithms (like MEM) on the final folds. That’s why these 2 commands are separate.
You can see what exactly each arg-generator
command do in agents/run_experiment.py
.
Hi, I used run_experiment.py script in order the retrieve the results for dqn. I used it the following:
python agents/run_experiment.py --use-test-split 1 --arg-generator base_dqn
I've got for the cross template the following scores: Fold0 - 0.3587 Fold1 - 0.2079 Fold2 - 0.3287 Fold3 - 0.2956 Fold4 - 0.1978 Fold5 - 0.3928 Fold6 - 0.2538 Fold7 - 0.2892 Fold8 - 0.1529 Fold9 - 0.3756 The average is : 0.2853, which is lower from the results from the paper which is 36.8±9.7.What I'm missing?
Thanks for your help.