After reading the paper, I took a look at the code, especially for the GPT class, but I found something I am a little bit confused about.
In the paper, it says that the input image is down-sampled to 5x22xC and LiDAR to 8x8xC. If I understand correctly, for inference batch size, in your comments B, should be one? And why the input size of the image is Bx4xseq_len, C, H, W in your comments, where does the number 4 come from? Maybe I misunderstood something.
Hi,
Thanks again for your contribution!
After reading the paper, I took a look at the code, especially for the GPT class, but I found something I am a little bit confused about.
def forward(self, image_tensor, lidar_tensor, velocity): """ Args: image_tensor (tensor): B*4*seq_len, C, H, W lidar_tensor (tensor): B*seq_len, C, H, W velocity (tensor): ego-velocity """
Best wishes! Thanks again!