Lifelong-Robot-Learning / LIBERO

Benchmarking Knowledge Transfer in Lifelong Robot Learning
MIT License
233 stars 35 forks source link

Questions regarding the benchmark details and the model checkpoints #2

Closed liuzuxin closed 1 year ago

liuzuxin commented 1 year ago

Hi @Cranial-XIX @zhuyifengzju , thanks for the great work! Impressive! I have a few questions:

  1. The model checkpoint links on the website are empty. I would like to know when these checkpoints are available.
  2. If I understand correctly, each spatial, goal, object dataset contains 10 tasks defined by 10 language instructions. Each task contains 50 fixed initial states, and each initial state has its corresponding demonstrations in the dataset, right?
  3. Regarding the pretraining experiment in Figure 3, what does the success rate mean? Are the results averaged over all LIBERO_10 tasks after performing full fine-tuning on LIBERO_10? It is really unclear to me what are the settings for the results and the methods (particularly w/o pretraining and multitask) in Figure 3. Also, I am curious whether there are any intuition or insights why pretraining on LIBERO_90 do not work well.
  4. Where could we find the results (succ. rate) of pretraining on LIBERO_90 and then testing on LIBERO_90?

Again, I am very interested in this work and feedback would be highly appreciated. Thanks in advance.

Cranial-XIX commented 1 year ago

Thanks for your interest in our work!

  1. We will provide the checkpoints soon.
  2. 1 task has 50 demonstrations and 50 initial states, the initial states are not necessarily the same as the initial state in the demonstration, but they are sampled i.i.d from the same initial state distribution. By doing so, we can test the in-distribution generalization.
  3. The success rate means the average success rate on LIBERO_10, using an initial network from scratch or from a pretrained model. For comparison, we also provide the average success rates on LIBERO_10 using multitask learning, as it serves as an upper bound.
  4. We do not provide success. rate results on LIBERO_90, but it can be recovered once we provide the pretrained checkpoints.

Thanks again for reaching out and sorry for the wait. We are still developing the codebase.

liuzuxin commented 1 year ago

Thanks for your reply! Looking forward to the release.

Regarding 2, do we need to evaluate each task by traversing over all 50 initial states to get an unbiased estimation? Or alternatively, using the same subset (such as 20 in your example evaluate.py code for evaluation is also fair for evaluation? Regarding 3, I am still a little confused about how to get the succ. rate, is it one of the AUC, FWT, NBT metrics proposed in the paper? Do you also apply sequential training for the pretrained and w/o pretrained models? Lastly, I am. also curious how does the maximum timestep 600 was selected? Is this value correspond to the maximum demonstration's length? Can we use smaller ones to accelerate evaluation?

Cranial-XIX commented 1 year ago
  1. We use 20 for evaluation as it is faster and we empirically found no major difference from using 50. But for convenience we also provide the rest of 30 initial states in case people want a result with smaller variance.
  2. Success rates are the averaged success rate on LIBERO-10, (average of 20 rollouts per task for 10 tasks, so 200 in total) after the method is applied for learning 10 tasks sequentially. We do not apply sequential training for the pretrained or w/o pretrained model as forgetting is huge for sequential finetuning. So little can be concluded using a model that is pretrained or not.
    • 600 is chosen based on our empirical finding. So 600 is long enough to complete any task, but short enough for efficiency.
liuzuxin commented 1 year ago

Thanks for the quick response!

liuzuxin commented 1 year ago

Hi @Cranial-XIX , I noticed that in the dataset.py, you mentioned that frame_stack is used instead of seq_len. However, in the main.py, seq_len arguments are set to 10 while frame_stack is set to 1. Would you mind clarifying the padding difference between them, as mentioned in the comment, and which one we should use? Thanks!

Cranial-XIX commented 1 year ago

Thanks for asking. Based on my understanding, they mainly differ in the padding style (if you set the padding like pad_frame_stack or pad_seq_length to True). Please see here. Assume your sequence is [0,1,2,3,4,5], setting seq_len=5 and frame_stack=1 with both padding will start at [0,1,2,3,4] and end in [5,x,x,x,x] where x means a zero-padding frame. Setting frame_stack=5 and seq_len=1 will start at [x,x,x,x,0] and end in [1,2,3,4,5].

But note that in practice, when we do rollout, we start from index 0 (no padding, because both LSTM and Transformer are able to deal with dynamic length input). So we will remove that comment. Thanks for catching that.

YuyangSunshine commented 4 months ago

Hi @Cranial-XIX , thanks for you efforts and wonderful work. Have you provided the checkpoints right now? I didn.t find the corresponding files. Best.