[Feat] Updating the evaluation script

Description

Updating the rl4co/tasks/eval.py for the latest version. I created this quick merge PR to write down the usage tutorial.

Motivation and Context

Fixing the first node selection problem for the sampling method;
A parser is used to launch the evaluation efficiently.

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds core functionality)

Tutorial for the evaluation

Step 1. Prepare your pre-trained model checkpoint and test instances data file. Put them in your preferred place. e.g., we will test the AttentionModel on TSP50:

.
├── rl4co/
│   └── ...
├── checkpoints/
│   └── am-tsp50.ckpt
└── data/
    └── tsp/
        └── tsp50_test_seed1234.npz

Step 2. Run the eval.py with your customized setting. e.g., let's use the sampling method with a top_p=0.95 sampling strategy:

python rl4co/tasks/eval.py --problem tsp --data_path data/tsp/tsp50_test_seed1234.npz --model AttentionModel --ckpt_path checkpoints/am-tsp50.ckpt --method sampling --top_p 0.95

You could check the rl4co/tasks/eval.py to see more supporting parameters with hints. Here are some notes:

We are now supporting 7 evaluation methods: greedy, sampling, multistart_greedy, augment_dihedral_8, augment, multistart_greedy_augment_dihedral_8, and multistart_greedy_augment.
The parameter --model is the class name, for example, use AttentionModel, POMO, SymNCO, etc.
By default, the evaluation results will be saved as a .pkl file under the --save_path. This file includes actions, rewards, inference_time, and avg_reward. You could collect them for the next step processing.
There are some parameters that are not commonly modified, so they are not in the parser list. For example, select_best=True for sampling evaluation. In the current version, you may want to hardcode and modify it.

Step 3. If you want to launch several evaluations with various parameters, you may refer to the following examples:

Evaluate POMO on TSP50 with a sampling of different Top-p and temperature:

  #!/bin/bash

  top_p_list=(0.5 0.6 0.7 0.8 0.9 0.95 0.98 0.99 0.995 1.0)
  temp_list=(0.1 0.3 0.5 0.7 0.8 0.9 1.0 1.1 1.2 1.5 1.8 2.0 2.2 2.5 2.8 3.0)

  problem=tsp
  model=POMO
  ckpt_path=checkpoints/pomo-tsp50.ckpt
  data_path=data/tsp/tsp50_test_seed1234.npz

  for top_p in ${top_p_list[@]}; do
      for temp in ${temp_list[@]}; do
          python rl4co/tasks/eval.py --problem ${problem} --model ${model} --ckpt_path ${ckpt_path} --data_path ${data_path} --method sampling --temperature=${temp} --top_p=${top_p} --top_k=0
      done
  done

Evaluate POMO on CVRP50 with a sampling of different Top-k and temperature:

  #!/bin/bash

  top_k_list=(5 10 15 20 25)
  temp_list=(0.1 0.3 0.5 0.7 0.8 0.9 1.0 1.1 1.2 1.5 1.8 2.0 2.2 2.5 2.8 3.0)

  problem=cvrp
  model=POMO
  ckpt_path=checkpoints/pomo-cvrp50.ckpt
  data_path=data/vrp/vrp50_test_seed1234.npz

  for top_k in ${top_k_list[@]}; do
      for temp in ${temp_list[@]}; do
          python rl4co/tasks/eval.py --problem ${problem} --model ${model} --ckpt_path ${ckpt_path} --data_path ${data_path} --method sampling --temperature=${temp} --top_p=0.0 --top_k=${top_k}
      done
  done

🙌 I will update one notebook for loading the results and do some statics soom.

ai4co / rl4co