Mismatch between the expert demonstration actions and environment action

askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

MIT License

379 stars 88 forks source link

Mismatch between the expert demonstration actions and environment action #136

Closed wzcai99 closed 1 year ago

wzcai99 commented 1 year ago

I tried to verify the performance of the expert demonstrations in the downloaded dataset, but I found the turning angle of the rotate action is different in the dataset. In the simulation, after the agent executes a rotation action, it turns about 30 degrees but in the dataset, two subsequent images seem to be close. For example, at the start of an episode, the initial rgb images are shown like this: And the next image in the dataset is shown like this: But in the simulations, it appears to be different like this test-1

I was wondering which part I might go wrong.

MohitShridhar commented 1 year ago

@wzcai99

Hmm... maybe you are looking at the video-interpolation frames in the dataset. Are you sure the dataset trajectory never reaches the same angle as the simulator, eventually? From FAQ

The Full Dataset contains extracted Resnet features for each frame in ['images'] which include filler frames inbetween each low-action (used to generate smooth videos), whereas Modeling Quickstart only contains features for each low_idx that correspond to frames after taking each low-level action.

wzcai99 commented 1 year ago

@wzcai99

Hmm... maybe you are looking at the video-interpolation frames in the dataset. Are you sure the dataset trajectory never reaches the same angle as the simulator, eventually? From FAQ

The Full Dataset contains extracted Resnet features for each frame in ['images'] which include filler frames inbetween each low-action (used to generate smooth videos), whereas Modeling Quickstart only contains features for each low_idx that correspond to frames after taking each low-level action.

The dataset trajectory reaches the same angle as the simulator eventually.
Well, I want to replace the ResNet Features with other backbones, so, I use the Full Dataset. But as the dataset contains the interpolated images, to train a policy, I need to manually remove those interpolated images? And I recheck the json file in the dataset, does the maximum image['low_idx'] represents the episode length?

MohitShridhar commented 1 year ago

@wzcai99, yes the low_idx is for all frames (including interpolation frames for video), and high_idx is for task-planning level actions like MoveAhead etc. (if I remember correctly).

vlongle commented 1 year ago

I'm also trying to use a different feature extractor. Do we have any easy way to figure out which frames are interpolated to remove them?

wzcai99 commented 1 year ago

@vlongle, from my perspective, the dataset JSON file contains a list with image data, with the image name and low_idx. I enumerate the entire list and only keep the first image for those redundant images with the same low_idx. As I use PyTorch for training, I only need to enumerate the dataset once at initializing the dataloader, and it won't influence the training speed.