Open Tschian opened 6 months ago
Hi,
Hello, i don't know why directly take the first dimension of the output as the action_token_out. After your grouping, the grouped input should follow this order: spatial_context_feature + region_feature + action_token + other obs feature. Would the dimension be changed when they pass through the transformer_decoder?
https://github.com/UT-Austin-RPL/VIOLA/blob/1e8b5ae90e73a1c2a33a369323117cd8d7b0ef36/viola_bc/modules.py#L2194 you can see that action token is always assumed to be the first one.
In addition, about the image augmentation (padding + random_crop), how many crops did you take? I saw around the code, only take the default value: num_crops=1. Doesn't the global feature really get lost if there is only one? Because i saw your code, the feature map is extracted from the cropped image.
The random cropping is just for shifting pixels by 4 or 8 (I forgot the exact numbers). So the cropped image should contain most of the information even after random cropping.
action_token_out = transformer_out[:, :, 0, :].
Hello, i don't know why directly take the first dimension of the output as the action_token_out. After your grouping, the grouped input should follow this order: spatial_context_feature + region_feature + action_token + other obs feature. Would the dimension be changed when they pass through the transformer_decoder?
In addition, about the image augmentation (padding + random_crop), how many crops did you take? I saw around the code, only take the default value: num_crops=1. Doesn't the global feature really get lost if there is only one? Because i saw your code, the feature map is extracted from the cropped image.
Could you help me figure out why and how? Thanks a lot