autonomousvision / transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving; [CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
MIT License
1.04k stars 175 forks source link

The difference between cvpr2021 and 2022 in fusion part #89

Closed rginjapan closed 1 year ago

rginjapan commented 1 year ago

Why in the 2022 branch, the liadr and RGB image could be fused in different resolution, sorry I do not go the details of the code. Thanks in advance!

ap229997 commented 1 year ago

The 2D feature maps from the image and LiDAR are reshaped into 1D and fed as tokens to the transformer. The resolution of feature maps doesn't affect the fusion part. In the cvpr2021 branch, the image and LiDAR BEV had the same resolution at the input (256x256) so that led to the same resolution of feature maps at each fusion step.

rginjapan commented 1 year ago

Thanks for your repley, so there is no changes in transformer fusion part? only because the resolution of input of transformer has changed?

ap229997 commented 1 year ago

The fusion part is the same.

rginjapan commented 1 year ago

Sorry to bother you and thanks for your always quick reply!! I am diving into the details of your code, I have two simple questions:

  1. why the lidar images has two channel, what do they represent?
  2. why the seq_len is 1, why not use a long sequence data as input?
rginjapan commented 1 year ago

Another question: I would like to visualize the attention map, how to do the evaluation to obtain it, thanks in advance!

rginjapan commented 1 year ago

If I understand correct, you only use the encoder of Transformer for fusion, why not consider decoder part (exclude the Linear and Softmax in the end)?

ap229997 commented 1 year ago

The 2-channel LiDAR BEV representation is a 2-bin histogram of points above and at ground level. We adopt this representation from PRECOG.

We did not observe any improvements for seq_len > 1 but I think we didn't try different possibilities. I'd suggest FIERY encoder if you want to include past observations.

You can check viz.py for visualizing attention maps.

A recent work InterFuser has used transformer decoder as well in the architecture.

rginjapan commented 1 year ago

Thanks for your infomration about InterFuser, so you also think that apply decoder into fusion part could improve the performance, right? Another fusion way I founf as the following, just FYI "MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION"

截屏2022-08-04 7 29 46
ap229997 commented 1 year ago

You can try different architectures and see how well they work.

rginjapan commented 1 year ago

Could you plz briefly explain what is the intuition to let you add the velocity embedding, does it improve the performance? Because I did not have this information in my fusion project, I am also considering what could be replaced as your velocity embedding. Thanks in advance!

ap229997 commented 1 year ago

The velocity provides information about the vehicle dynamics which can help with future motion prediction (especially when considering only single timestep input). Incorporating velocity in the architecture is a design choice and it requires trying different possibilities. I don't remember the exact ablation scores but several works (eg. CILRS, TCP, InterFuser) use velocity in different ways.

rginjapan commented 1 year ago

Thanks for your reply, does "single timestep input" mean the seq_len is 1 in your implementation? Is it possible to use one modality input in encoder and another modality input to the decoder in transformer?

ap229997 commented 1 year ago

Yes, single timestep input means seq_len=1. The usage of different modalities at different locations in the architecture is a design choice that depends on the task and implementation details.

Kait0 commented 1 year ago

We have observed in the 2022 paper that the velocity embedding reduces performance, we don't use it anymore. See Table 10. Other papers have also observed the phenomenon that giving the network information about the past leads to shortcut learning and reduces performance (e.g. CILRS, Causal confusion in imitation learning, Copycat agents). The current velocity is only provided to the PID controller in the 2022 work, that has no learned components.

For the same reason we only use sequence length 1. Naively using larger sequence lengths will reduce performance I think (we have not explicitly investigated it in these works).

rginjapan commented 1 year ago

@Kait0 Thanks for your reply! I cannot understand why giving the network past information leads to reducing the performance, why not help? It is a sequence to sequence problem, the pasr information is a memory intuitionly it should help the network to learn something, right? The same as larger sequence, I think the larger the easier to help network learning the something between each timestep.

Kait0 commented 1 year ago

You will need to read the papers if you want to gain a better understanding. This is a phenomenon that the community has consistently observed, but I don't think it is fully understood yet. I think the leading hypothesis right now is there are strong spurious correlations in the temporal data. For example at a red light there is a high probability that you are going to keep still if your velocity is 0.0. This can lead to good open loop training loss because copying the previous actions is only wrong in a few examples. Such a policy would break down during closed loop evaluation though (e.g. a model that learned to predict the future velocity by copying the past velocity will never start to drive).

rginjapan commented 1 year ago

Thanks for your kind explanantion. Btw, I cannot understand the expalantion of attention maps in the paper (e.g. what is the meaning of that three colors boxes), can you give me some more expalanation here, thanks in advance!

ap229997 commented 1 year ago

Each token is a 32x32 patch in the input modality and the attention layers in the transformer compute attention between these tokens. Yellow denotes the source token which we are analyzing and green denotes the 5 tokens which have the highest weights in the attention maps. We also highlight the presence of vehicles in LiDAR in red for ease of visualization (the red area is not an input or output of the attention layer).

rginjapan commented 1 year ago

I found I cannot use you provided pretrained model to run vis.py, could you give me more details to run viz.py for visualize attention map. Thanks in advance!!

Sorry, I have solved, thanks!

rginjapan commented 1 year ago

How to define the vertical and horizontal anchors for my own data?

ap229997 commented 1 year ago

You can increase it but you also need to consider the compute resources you have (computation increases quadratically with the anchor size)

rginjapan commented 1 year ago

Thanks for your reply, does the anchors mean the size of grid feature map for patch token?

ap229997 commented 1 year ago

Each feature map is of size HxWxC which is reshaped into HW tokens of dimension C. Here, anchors refers to H and W.

rginjapan commented 1 year ago

I noticed that in ViLT(https://arxiv.org/abs/2102.03334), they have a model-type embedding to seperate the textual and image embedding, how do you seperate lidar and RGB without model-type embedding? How do you know which output of transformer fusion belongs to lidar or RGB for the next step CONV and fusion?

ap229997 commented 1 year ago

In our implementation, the first half of the output belongs to RGB and the later half to LiDAR. We did not try using a model-type embedding.

rginjapan commented 1 year ago

Thanks for your reply! In your implementation of positional embedding, you used: self.pos_emb = nn.Parameter(torch.zeros(1, (1 + 1) * seq_len * vert_anchors * horz_anchors, n_embd)) I noticed that BERT who is the first one to use a learnable positional embeddiing instead of sin and cos function in transformer, they used nn.Embedding() to implement this learnable PE, do you know what is the difference between nn.Parameter() and nn.Embedding() ?

ap229997 commented 1 year ago

https://pytorch.org/docs/stable/_modules/torch/nn/modules/sparse.html#Embedding

rginjapan commented 1 year ago

We observe that for 62.75% of the image tokens, all the top-5 attended tokens belong to the LiDAR and for 88.87%, at least one token in the top-5 attended tokens be- long to the LiDAR.

What is the meaning of image tokens belong to LiDAR? How to define which token belongs to one modality in output?

ap229997 commented 1 year ago

In the cvpr2021 branch, the transformer takes in 128 tokens as input: 64 from the image feature map and 64 from the LiDAR feature map. The output also has 128 tokens and in our implementation, the first half of the output belongs to RGB and the later half to LiDAR. This is by design.

rginjapan commented 1 year ago

I see. The diemension of atten_map in cvpr2021 baranch is 244128128, in the paper you said top-5 attention weights in 24 tokens, how to understand the top-5 among these 24 atten_maps (4128*128)?

ap229997 commented 1 year ago

Its top-5 attention weights for each of the 24 tokens.

rginjapan commented 1 year ago

Where can I obtain the weight of each attention token?

ap229997 commented 1 year ago

Check viz.py for implementation details.

rginjapan commented 1 year ago

I see. The diemension of atten_map in cvpr2021 baranch is 24_4_128_128, in the paper you said top-5 attention weights in 24 tokens, how to understand the top-5 among these 24 atten_maps (4_128*128)?

@ap229997 Sorry for stupid question always. I would like to confirm my understanding about visualization of attention maps. One token among all (64 + 64) 128 tokens generates the attention vector which size is 1*128, and if the highest value in the former 64 values, it is belong to Image, otherewise belongs to LiDAR. Am I correct?

If I am correct, the 128*128 should be a attention matrix including all input tokens' attention?

ap229997 commented 1 year ago

I see. The diemension of atten_map in cvpr2021 baranch is 24_4_128_128, in the paper you said top-5 attention weights in 24 tokens, how to understand the top-5 among these 24 atten_maps (4_128*128)?

@ap229997 Sorry for stupid question always. I would like to confirm my understanding about visualization of attention maps. One token among all (64 + 64) 128 tokens generates the attention vector which size is 1*128, and if the highest value in the former 64 values, it is belong to Image, otherewise belongs to LiDAR. Am I correct?

Yes.

If I am correct, the 128*128 should be a attention matrix including all input tokens' attention?

Yes, '4' corresponds to the 4 attention heads in our architecture and the first dimension is the batch size.