Hi @imankgoyal , thanks for your great work! I have one question about the failure cases in evaluation. For example, when I run evaluation with the provided checkpoint, the log shows:
Evaluating put_item_in_drawer | Episode 0 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 1 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 2 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 3 | Score: 100.0 | Episode Length: 11 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 4 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 5 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 6 | Score: 0.0 | Episode Length: 2 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 7 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 8 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 9 | Score: 0.0 | Episode Length: 25 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 10 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 11 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 12 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 13 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 14 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 15 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 16 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 17 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 18 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 19 | Score: 0.0 | Episode Length: 1 | Lang Goal: put the item in the bottom drawer
Evaluating put_item_in_drawer | Episode 20 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 21 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 22 | Score: 100.0 | Episode Length: 13 | Lang Goal: put the item in the top drawer
Evaluating put_item_in_drawer | Episode 23 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the middle drawer
Evaluating put_item_in_drawer | Episode 24 | Score: 100.0 | Episode Length: 12 | Lang Goal: put the item in the bottom drawer
[Evaluation] Finished put_item_in_drawer | Final Score: 88.0
For the 3 failure cases, I can understand Episode 9 which fails to accomplish the task within 25 steps (timeout). However, for Episode 6 & 19, I wonder why it fails after only 1-2 steps?
I did a quick investigation, and found it is caused by transition in episode_rollout, which has the attribute transition.terminal = True after 1-2 step, causing the rollout to halt.
Hi @imankgoyal , thanks for your great work! I have one question about the failure cases in evaluation. For example, when I run evaluation with the provided checkpoint, the log shows:
For the 3 failure cases, I can understand Episode 9 which fails to accomplish the task within 25 steps (timeout). However, for Episode 6 & 19, I wonder why it fails after only 1-2 steps?
I did a quick investigation, and found it is caused by
transition
inepisode_rollout
, which has the attributetransition.terminal = True
after 1-2 step, causing the rollout to halt.Looking forward to your reply, thank you!