Closed rginjapan closed 1 year ago
The 2D feature maps from the image and LiDAR are reshaped into 1D and fed as tokens to the transformer. The resolution of feature maps doesn't affect the fusion part. In the cvpr2021
branch, the image and LiDAR BEV had the same resolution at the input (256x256) so that led to the same resolution of feature maps at each fusion step.
Thanks for your repley, so there is no changes in transformer fusion part? only because the resolution of input of transformer has changed?
The fusion part is the same.
Sorry to bother you and thanks for your always quick reply!! I am diving into the details of your code, I have two simple questions:
Another question: I would like to visualize the attention map, how to do the evaluation to obtain it, thanks in advance!
If I understand correct, you only use the encoder of Transformer for fusion, why not consider decoder part (exclude the Linear and Softmax in the end)?
The 2-channel LiDAR BEV representation is a 2-bin histogram of points above and at ground level. We adopt this representation from PRECOG.
We did not observe any improvements for seq_len > 1
but I think we didn't try different possibilities. I'd suggest FIERY encoder if you want to include past observations.
You can check viz.py for visualizing attention maps.
A recent work InterFuser has used transformer decoder as well in the architecture.
Thanks for your infomration about InterFuser, so you also think that apply decoder into fusion part could improve the performance, right? Another fusion way I founf as the following, just FYI "MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION"
You can try different architectures and see how well they work.
Could you plz briefly explain what is the intuition to let you add the velocity embedding, does it improve the performance? Because I did not have this information in my fusion project, I am also considering what could be replaced as your velocity embedding. Thanks in advance!
The velocity provides information about the vehicle dynamics which can help with future motion prediction (especially when considering only single timestep input). Incorporating velocity in the architecture is a design choice and it requires trying different possibilities. I don't remember the exact ablation scores but several works (eg. CILRS, TCP, InterFuser) use velocity in different ways.
Thanks for your reply, does "single timestep input" mean the seq_len is 1 in your implementation? Is it possible to use one modality input in encoder and another modality input to the decoder in transformer?
Yes, single timestep input means seq_len=1
. The usage of different modalities at different locations in the architecture is a design choice that depends on the task and implementation details.
We have observed in the 2022 paper that the velocity embedding reduces performance, we don't use it anymore. See Table 10. Other papers have also observed the phenomenon that giving the network information about the past leads to shortcut learning and reduces performance (e.g. CILRS, Causal confusion in imitation learning, Copycat agents). The current velocity is only provided to the PID controller in the 2022 work, that has no learned components.
For the same reason we only use sequence length 1. Naively using larger sequence lengths will reduce performance I think (we have not explicitly investigated it in these works).
@Kait0 Thanks for your reply! I cannot understand why giving the network past information leads to reducing the performance, why not help? It is a sequence to sequence problem, the pasr information is a memory intuitionly it should help the network to learn something, right? The same as larger sequence, I think the larger the easier to help network learning the something between each timestep.
You will need to read the papers if you want to gain a better understanding. This is a phenomenon that the community has consistently observed, but I don't think it is fully understood yet. I think the leading hypothesis right now is there are strong spurious correlations in the temporal data. For example at a red light there is a high probability that you are going to keep still if your velocity is 0.0. This can lead to good open loop training loss because copying the previous actions is only wrong in a few examples. Such a policy would break down during closed loop evaluation though (e.g. a model that learned to predict the future velocity by copying the past velocity will never start to drive).
Thanks for your kind explanantion. Btw, I cannot understand the expalantion of attention maps in the paper (e.g. what is the meaning of that three colors boxes), can you give me some more expalanation here, thanks in advance!
Each token is a 32x32 patch in the input modality and the attention layers in the transformer compute attention between these tokens. Yellow denotes the source token which we are analyzing and green denotes the 5 tokens which have the highest weights in the attention maps. We also highlight the presence of vehicles in LiDAR in red for ease of visualization (the red area is not an input or output of the attention layer).
I found I cannot use you provided pretrained model to run vis.py, could you give me more details to run viz.py for visualize attention map. Thanks in advance!!
Sorry, I have solved, thanks!
How to define the vertical and horizontal anchors for my own data?
You can increase it but you also need to consider the compute resources you have (computation increases quadratically with the anchor size)
Thanks for your reply, does the anchors mean the size of grid feature map for patch token?
Each feature map is of size HxWxC
which is reshaped into HW
tokens of dimension C
. Here, anchors refers to H
and W
.
I noticed that in ViLT(https://arxiv.org/abs/2102.03334), they have a model-type embedding to seperate the textual and image embedding, how do you seperate lidar and RGB without model-type embedding? How do you know which output of transformer fusion belongs to lidar or RGB for the next step CONV and fusion?
In our implementation, the first half of the output belongs to RGB and the later half to LiDAR. We did not try using a model-type embedding.
Thanks for your reply! In your implementation of positional embedding, you used:
self.pos_emb = nn.Parameter(torch.zeros(1, (1 + 1) * seq_len * vert_anchors * horz_anchors, n_embd))
I noticed that BERT who is the first one to use a learnable positional embeddiing instead of sin and cos function in transformer, they used nn.Embedding()
to implement this learnable PE, do you know what is the difference between nn.Parameter()
and nn.Embedding()
?
We observe that for 62.75% of the image tokens, all the top-5 attended tokens belong to the LiDAR and for 88.87%, at least one token in the top-5 attended tokens be- long to the LiDAR.
What is the meaning of image tokens belong to LiDAR? How to define which token belongs to one modality in output?
In the cvpr2021 branch, the transformer takes in 128 tokens as input: 64 from the image feature map and 64 from the LiDAR feature map. The output also has 128 tokens and in our implementation, the first half of the output belongs to RGB and the later half to LiDAR. This is by design.
I see. The diemension of atten_map in cvpr2021 baranch is 244128128, in the paper you said top-5 attention weights in 24 tokens, how to understand the top-5 among these 24 atten_maps (4128*128)?
Its top-5 attention weights for each of the 24 tokens.
Where can I obtain the weight of each attention token?
I see. The diemension of atten_map in cvpr2021 baranch is 24_4_128_128, in the paper you said top-5 attention weights in 24 tokens, how to understand the top-5 among these 24 atten_maps (4_128*128)?
@ap229997 Sorry for stupid question always. I would like to confirm my understanding about visualization of attention maps. One token among all (64 + 64) 128 tokens generates the attention vector which size is 1*128, and if the highest value in the former 64 values, it is belong to Image, otherewise belongs to LiDAR. Am I correct?
If I am correct, the 128*128 should be a attention matrix including all input tokens' attention?
I see. The diemension of atten_map in cvpr2021 baranch is 24_4_128_128, in the paper you said top-5 attention weights in 24 tokens, how to understand the top-5 among these 24 atten_maps (4_128*128)?
@ap229997 Sorry for stupid question always. I would like to confirm my understanding about visualization of attention maps. One token among all (64 + 64) 128 tokens generates the attention vector which size is 1*128, and if the highest value in the former 64 values, it is belong to Image, otherewise belongs to LiDAR. Am I correct?
Yes.
If I am correct, the 128*128 should be a attention matrix including all input tokens' attention?
Yes, '4' corresponds to the 4 attention heads in our architecture and the first dimension is the batch size.
Why in the 2022 branch, the liadr and RGB image could be fused in different resolution, sorry I do not go the details of the code. Thanks in advance!