autonomousvision / transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving; [CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
MIT License
1.11k stars 185 forks source link

Questions about evaluation method and network initialization #58

Closed gitped closed 2 years ago

gitped commented 2 years ago

Hello, 1.- In your transfuser paper you report the mean and std dev over 9 runs of each method (3 training seeds, each seed evaluated 3 times). Does this mean that you changed the REPETITIONS value in the run_evaluation.sh script to REPETITIONS=3 or did you evaluate each model 3 separate times with REPETITIONS=1?

2.- I understand that each model is generated with a random training seed, leading to some variance in the results. If I wanted to make more reproducible and deterministic models, in what part of the code can I set fixed training seeds or modify the network initialization method?

ap229997 commented 2 years ago
  1. We evaluated 3 separate times with REPETITIONS=1 to parallelize it across multiple GPUs for faster evaluation, REPETITIONS=3 is also fine.

  2. You can modify train.py of the desired model to include the following in the very beginning (you can set the seed value as per your choice):

    def seed(seed=42):
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
gitped commented 2 years ago

Thank you. Also, do you happen to know what are the time complexities of CILRS, AIM, and the 3 fusion models you test?

ap229997 commented 2 years ago

What exactly do you mean by time complexities (train time, eval time, or algorithmic complexity)?

gitped commented 2 years ago

I mean their algorithmic complexities in big O notation, such as O(n^2).

ap229997 commented 2 years ago

Let N be the number of tokens (for HxW grid data like images or LiDAR BEV, N = H*W)

  1. Transfuser: O(N2) since quadratic attention is used (new transformer models like Linformer have better complexity and can be used as well)
  2. Geometric fusion: O(N) since the image-LiDAR correspondences can be precomputed (also depends on how tensor indexing is implemented in PyTorch)
  3. Late fusion: O(N)
  4. CILRS, AIM: O(N) with a smaller constant factor than fusion methods