Question about the action Token and image augmentation

UT-Austin-RPL / VIOLA

Official implementation for VIOLA

MIT License

107 stars 7 forks source link

action_token_out = transformer_out[:, :, 0, :].

Hello, i don't know why directly take the first dimension of the output as the action_token_out. After your grouping, the grouped input should follow this order: spatial_context_feature + region_feature + action_token + other obs feature. Would the dimension be changed when they pass through the transformer_decoder?

In addition, about the image augmentation (padding + random_crop), how many crops did you take? I saw around the code, only take the default value: num_crops=1. Doesn't the global feature really get lost if there is only one? Because i saw your code, the feature map is extracted from the cropped image.

Could you help me figure out why and how? Thanks a lot

Hi,

Hello, i don't know why directly take the first dimension of the output as the action_token_out. After your grouping, the grouped input should follow this order: spatial_context_feature + region_feature + action_token + other obs feature. Would the dimension be changed when they pass through the transformer_decoder?

https://github.com/UT-Austin-RPL/VIOLA/blob/1e8b5ae90e73a1c2a33a369323117cd8d7b0ef36/viola_bc/modules.py#L2194 you can see that action token is always assumed to be the first one.

In addition, about the image augmentation (padding + random_crop), how many crops did you take? I saw around the code, only take the default value: num_crops=1. Doesn't the global feature really get lost if there is only one? Because i saw your code, the feature map is extracted from the cropped image.

The random cropping is just for shifting pixels by 4 or 8 (I forgot the exact numbers). So the cropped image should contain most of the information even after random cropping.

UT-Austin-RPL / VIOLA

Question about the action Token and image augmentation #8