NVlabs / RVT

Official Code for RVT-2 and RVT
https://robotic-view-transformer-2.github.io/
Other
279 stars 33 forks source link

Data collection issues #33

Closed LemonWade closed 11 months ago

LemonWade commented 11 months ago

Hi @imankgoyal

Firstly, I would like to express my gratitude for your exceptional work. It's truly inspiring and helpful.

I have a question related to data handling, specifically regarding keyframes. Let's consider a scenario where there are three keyframes, and the first keyframe is at frame 36. When collecting data for frames 10, 20, and 30, it appears that these three keyframes are recollected, and the timestep uniformly decreases. Moreover, in the low_dim_state, frames 10, 20, and 30 all seem to be treated as the starting point of the event, with the timestep set to 1 for each.

However, I'm a bit puzzled about the handling of frame 40. Why is the timestep reset to 1 at this point? Additionally, during the evaluation process, I haven't noticed any reset of the timestep. In my opinion, it might be more consistent to disregard the timestep. What are your thoughts on this approach? Any clarification or insight you could provide would be greatly appreciated.

Thank you for your time and consideration.

Best regards, LemonWade

imankgoyal commented 11 months ago

Hi @LemonWade,

Thanks for your interest in our work and kind words.

Your confusion is valid. If 36 is the first keyframe, then frame 40 should ideally be timestep 2. Could you point to the code where you find this issue? This could be something borrowed from PerAct's dataloder that I donot remember of the top of my head.

Best, Ankit

LemonWade commented 11 months ago

Let's look at the following function in theutils.dataset.py file:

def _add_keypoints_to_replay(
    ...
    next_keypoint_idx: int,
   ...
):
    prev_action = None
    obs = inital_obs
    for k in range(
        next_keypoint_idx, len(episode_keypoints)
    ):  # confused here, it seems that there are many similar samples in the replay
        keypoint = episode_keypoints[k]
        obs_tp1 = demo[keypoint]
        obs_tm1 = demo[max(0, keypoint - 1)]
        (
            trans_indicies,
            rot_grip_indicies,
            ignore_collisions,
            action,
            attention_coordinates,
        ) = _get_action(
            obs_tp1,
            obs_tm1,
            rlbench_scene_bounds,
            voxel_sizes,
            rotation_resolution,
            crop_augmentation,
        )
       ...
        obs_dict = extract_obs(
            obs,
            CAMERAS,
            t=k - next_keypoint_idx,
            prev_action=prev_action,
            episode_length=25,
        )

As we can see,

for k in range(
    next_keypoint_idx, len(episode_keypoints)
):

and t=k - next_keypoint_idx, it implies that every time we run this function, the first data saved will always have t equal to 0.

This means that when we cross the 0th keyframe and choose the 1st keyframe as next_keypoint_idx, the t for the first data collected at this point will be reset to 0."

Then t, after being processed by the extract_obs function, is transformed into a number between [-1, 1], where 0 corresponds to time being 1, and this is stored in low_dim_state. Later, it will go through embedding and enter the model.

imankgoyal commented 11 months ago

Hi @LemonWade,

Great question. Unfortunately, I do not have a very good answer for it. What I would say is that this setup is the same as what is done in PerAct (https://github.com/peract/peract/blob/b73cf504567122f81912964943a16cd83920e4d9/agents/peract_bc/launch_utils.py#L265-L273) and we keep everything in the RLBench dataset the same for fairness.

For the real world, we did as you were mentioning and as mentioned here: https://github.com/NVlabs/RVT/issues/33#issuecomment-1823685357

For RLBench, I would recommend you to try the alternative, i.e. time step does not become zero, and see if that helps. I would be very interested in the results :)

Best, Ankit

LemonWade commented 11 months ago

Thank you for your answer. I will take it seriously, and if there are any interesting findings, I will reopen this issue.