Question about evaluation failure case

Walter0807 commented 5 months ago

Hi @imankgoyal , thanks for your great work! I have one question about the failure cases in evaluation. For example, when I run evaluation with the provided checkpoint, the log shows:

Evaluating put_item_in_drawer | Episode 0 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 1 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 2 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 3 | Score: 100.0 | Episode Length: 11 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 4 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 5 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 6 | Score: 0.0 | Episode Length: 2 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 7 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 8 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 9 | Score: 0.0 | Episode Length: 25 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 10 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 11 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 12 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 13 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 14 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 15 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 16 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 17 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 18 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 19 | Score: 0.0 | Episode Length: 1 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 20 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 21 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 22 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 23 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 24 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
[Evaluation] Finished put_item_in_drawer | Final Score: 88.0

For the 3 failure cases, I can understand Episode 9 which fails to accomplish the task within 25 steps (timeout). However, for Episode 6 & 19, I wonder why it fails after only 1-2 steps?

I did a quick investigation, and found it is caused by transition in episode_rollout, which has the attribute transition.terminal = True after 1-2 step, causing the rollout to halt.

Looking forward to your reply, thank you!

Walter0807 commented 5 months ago

I found it is due to exception in path planning:

https://github.com/NVlabs/peract/blob/5c2988edb961d67d7a921cbbc638f69947debff8/helpers/custom_rlbench_env.py#L328

imankgoyal commented 5 months ago

Hi @Walter0807 , apologies for the delayed response. Yes, path planning exceptions are expected, and they affect the overall performance.

NVlabs / RVT

Question about evaluation failure case #42