Performance Discrepancies on libero_object Evaluation

pengzhi1998 commented 3 weeks ago

Dear Authors,

Sorry to bother you again. I have a question regarding the performance evaluation on the libero_object dataset. I ran LIBERO with the following command:

export CUDA_VISIBLE_DEVICES=0 && export MUJOCO_EGL_DEVICE_ID=0 && python lifelong/main.py seed=1 benchmark_name=libero_object policy=bc_transformer_policy lifelong=er

I conducted two evaluations with slightly different configurations. First Evaluation Configuration:

load_path: "" # only used when separately evaluating a pretrained model
eval: true
batch_size: 64
num_workers: 4
n_eval: 10
eval_every: 5
max_steps: 600
use_mp: false
num_procs: 10
save_sim_states: false

Results of the 10 tasks:

[info] evaluate task 0 takes 121.7 seconds
[info] evaluate task 1 takes 137.5 seconds
[info] evaluate task 2 takes 56.4 seconds
[info] evaluate task 3 takes 95.5 seconds
[info] evaluate task 4 takes 115.3 seconds
[info] evaluate task 5 takes 60.4 seconds
[info] evaluate task 6 takes 128.6 seconds
[info] evaluate task 7 takes 97.0 seconds
[info] evaluate task 8 takes 120.8 seconds
[info] evaluate task 9 takes 106.4 seconds
[info] train time (min) 81.9 eval loss time 2.9 eval success time 17.3
[Task  9 loss ]  -6.28 | -6.56 | -11.11 | -8.18 | -7.98 | -6.96 | -10.55 | -13.04 | -20.24 | -21.90 |
[Task  9 succ.]  0.20 | 0.10 | 0.90 | 0.50 | 0.30 | 0.90 | 0.10 | 0.50 | 0.20 | 0.50 |
[info] finished learning

Second Evaluation Configuration:

load_path: "" # only used when separately evaluating a pretrained model
eval: true
batch_size: 64
num_workers: 4
n_eval: 25
eval_every: 10
max_steps: 600
use_mp: false
num_procs: 25
save_sim_states: false

Results:

[info] evaluate task 0 takes 202.4 seconds
[info] evaluate task 1 takes 241.4 seconds
[info] evaluate task 2 takes 134.6 seconds
[info] evaluate task 3 takes 343.1 seconds
[info] evaluate task 4 takes 181.1 seconds
[info] evaluate task 5 takes 243.8 seconds
[info] evaluate task 6 takes 148.3 seconds
[info] evaluate task 7 takes 192.6 seconds
[info] evaluate task 8 takes 158.1 seconds
[info] evaluate task 9 takes 166.8 seconds
[info] train time (min) 87.5 eval loss time 2.9 eval success time 33.5
[Task  9 loss ]  52.18 | 46.33 | 27.23 | 49.31 | 44.87 | 33.89 | 20.85 | 22.12 | -3.64 | -27.05 |
[Task  9 succ.]  0.64 | 0.52 | 0.92 | 0.00 | 0.72 | 0.48 | 0.88 | 0.72 | 0.88 | 0.80 |
[info] finished learning

I'm confused about two points:

Despite the few differences in the evaluation settings, the results show significant discrepancies. For instance, the evaluation loss for the first task is -6.28 in the first run but 52.18 in the second run. Why is there such a large gap?
I expected the success rate for earlier tasks to decrease as new tasks are learned in the lifelong run. However, in some cases, the success rate for earlier tasks remains high or even higher than the latest tasks. Would it possibly be because the number of evaluations is not enough for each task (25 episodes)? Could you provide guidance on why this might happen? Thank you again for your attention and help. I look forward to your reply!

Best regards, Pengzhi

HeegerGao commented 2 weeks ago

Hi @pengzhi1998 , interesting findings. Here are my thoughts:

Different loss across two experiments: the loss is calculated by the negative log probability of the GMM head, and the variance of the GMM will significantly influence the log probability. I guess the learned policies among these two experiments have different final mean & variance, thus the loss is so different.
In lifelong learning, yes generally the results on previously learned tasks are worse than the latest tasks. You can see that the loss of the latest task is always the lowest among the 10 learned tasks. However, for decision-making tasks, the success rate is NOT always proportional to the loss because the decision-making process is a sequential process. Easier tasks may remain high success rates even with a higher loss. Besides, the task order may also have some influences. This is one of the reasons why we build this LIBERO benchmark for the community to study why there is a mismatch between the loss and success rate.

Hope these can answer your concerns.

pengzhi1998 commented 2 weeks ago

Thank you, Chongkai, for your reply and clear explanation! @HeegerGao

This is very clear to me. Thanks a lot!
I understand that success rates don't have a strong relationship with losses since these are sequential decision-making tasks. However, the most confusing observation to me is that, in both runs, many of the older tasks seem to have higher success rates than the newer ones. Additionally, the success rates between the two runs differ quite a bit.

Thank you so much again!!

Lifelong-Robot-Learning / LIBERO

Performance Discrepancies on libero_object Evaluation #23