Closed StarCycle closed 4 months ago
Hi @StarCycle , the target for the rgb prediction is three timestep in the future. But the interval between two timesteps in an input sequence is one.
Hi @bdrhtw,
If I understand correctly, the loss function will be
rgb_mask = batch['mask'].unsqueeze(-1).unsqueeze(-1)[:, 3:]
loss['rgb_static'] = (F.mse_loss(pred['obs_preds'][:, :-3], pred['obs_targets'][:, 3:], reduction='none')*rgb_mask).mean()
loss['rgb_gripper'] = (F.mse_loss(pred['obs_hand_preds'][:, :-3], pred['obs_hand_targets'][:, 3:], reduction='none')*rgb_mask).mean()
So you use the first 7 items of pred['obs_preds']
to predict the last 7 items in pred['obs_targets']
. So
The target for the rgb prediction is three timestep in the future. But the interval between two timesteps in an input sequence is one.
can be achieved.
Is my understanding correct?
Best, Star Cycle
Yes, it's correct.
Thanks!
Hi @bdrhtw @hongtaowu67,
In appendix A.4 you mentioned:
And in appendix A.1:
The input and output shapes of the network are:
If I understand correctly, when finetuning on the CALVIN dataset, the interval between
rgb_data[0, 0]
andrgb[0, 1]
is ∆t = 3. The interval betweenprediction['obs_preds'][0, 0]
andprediction['obs_preds'][0, 1]
is also ∆t = 3.Since you use relative ee action space,
prediction['arm_action_preds']
is actually the sum of 3 consecutive relative actions?For example,