aimotive / mm_training

Multimodal model training on aiMotive Dataset
https://openreview.net/forum?id=LW3bRLlY-SA
MIT License
14 stars 4 forks source link

How do we train a camera only model? #6

Open PaulSudarshan opened 4 months ago

PaulSudarshan commented 4 months ago

Please let me know if it is possible to train a camera only model with the current repository? Also provide a sample configuration for running a camera only model.

TamasMatuszka commented 3 months ago

Hi @PaulSudarshan,

Yes, you can train camera-only models with the repository. However, the results will not be as good as the fusion models. You can find an example config for camera-only training here.

PaulSudarshan commented 3 months ago

Thanks for providing the config @TamasMatuszka . Is it possible to use the 2D bounding box available with the dataset to train a model for 2D Object Detection?

TamasMatuszka commented 3 months ago

In this repository, we did not use the 2D bounding boxes. However, 2D bounding boxes might be useful. We showed one possible application of 2D bounding boxes for 3D object detection in this paper.

PaulSudarshan commented 3 months ago

@TamasMatuszka The 2D bounding box annotations given in the dataset, how are they generated? Are they annotated or generated through some model inference?

TamasMatuszka commented 3 months ago

@PaulSudarshan We used an in-house 2D bounding box detector for generating the 2D annotations.

PaulSudarshan commented 3 months ago

@TamasMatuszka is it possible to train a camera only model without using explicit depth_labels from lidar point cloud data? What changes are required to use camera intrinsic/extrinsic for dept guidance instead of lidar ?

TamasMatuszka commented 3 months ago

@PaulSudarshan you can either introduce a use_depth_loss flag to the config file and use it in the training step or you can comment out those lines.

PaulSudarshan commented 3 months ago

Okay thanks, that makes sense. So just for my understanding, keeping the depth loss as 0, would mean the network is trying to implicitly learn depth without explicit depth labels similar to networks like Lift-Splat-Shoot? How do you think the network behaviour would change by training with and without implicit depth information? @TamasMatuszka

TamasMatuszka commented 3 months ago

@PaulSudarshan You are right, keeping the depth loss as zero would mimic the training method of LSS. I assume the training time might be longer but it shall be tried.

PaulSudarshan commented 3 months ago

image What does the parameter "self.pass_depth_labels" control? If it is set to False (which is set by-default in the code), does it mean there is no input depth to the model? In this case, while training a camera only model there is no implicit depth to the model? Can you please clarify? @TamasMatuszka

TamasMatuszka commented 3 months ago

@PaulSudarshan you are right, if self.pass_depth_labels is False, then there is no depth input. Hardcoding it as False seems to be a bug. Thanks for finding it! I already updated the code.

PaulSudarshan commented 3 months ago

@PaulSudarshan you are right, if self.pass_depth_labels is False, then there is no depth input. Hardcoding it as False seems to be a bug. Thanks for finding it! I already updated the code.

Thanks for the quick support and resolution. So does this change have any impact on the results which are published in the AIMotive paper? Does it mean all the baseline models were trained without depth input? @TamasMatuszka

TamasMatuszka commented 3 months ago

@PaulSudarshan Yes, it means that the models were trained without depth input. The models using camera inputs are the extensions of the Lift-Splat-Shoot model. If you set the use_depth_loss config parameter to True and retrain these models, they will be the extension of the BevDepth model.

PaulSudarshan commented 2 months ago

@PaulSudarshan Yes, it means that the models were trained without depth input. The models using camera inputs are the extensions of the Lift-Splat-Shoot model. If you set the use_depth_loss config parameter to True and retrain these models, they will be the extension of the BevDepth model.

Thanks for the suggestion. I tried training a camera only model with the depth_loss flag enabled, however I am seeing an unusual spike in train_depth_loss after certain epoch , below is a screenshot of the sudden increase in train_depth_loss during 6th epoch. I tried running the training multiple times with same configuration, the same observation was observed but on different epochs, first time in 11th epoch and then in 6th epoch. Can you explain the root cause behind this? Thanks image

TamasMatuszka commented 2 months ago

@PaulSudarshan Unfortunately, I could not find the root cause of this instability during training but we also experienced it in the case of camera-only training. One trick is to use gradient accumulation, this helps mitigate the problems caused by the small batch size during batch normalization. Another solution might be gradient clipping.

We used this config to train a cam-only model without an explosion in the loss though it was trained without depth supervision. Don't forget to add the use_depth_loss = True tag to the config! Hopefully, it will work for depth-supervised training too.

PaulSudarshan commented 2 months ago

@PaulSudarshan Unfortunately, I could not find the root cause of this instability during training but we also experienced it in the case of camera-only training. One trick is to use gradient accumulation, this helps mitigate the problems caused by the small batch size during batch normalization. Another solution might be gradient clipping.

We used this config to train a cam-only model without an explosion in the loss though it was trained without depth supervision. Don't forget to add the use_depth_loss = True tag to the config! Hopefully, it will work for depth-supervised training too.

Sure, Thanks for the inputs. Is it possible to share a sample loss value for any of your trainings (preferably cam only model), and the total epochs the model was trained for?

TamasMatuszka commented 2 months ago

@PaulSudarshan I could not find any saved loss curve but I collected val loss/epoch values from the checkpoints generated during a camera-only training. epoch=0-step=5351-val_detection_loss=9.52.ckpt epoch=1-step=10702-val_detection_loss=9.89.ckpt epoch=2-step=16053-val_detection_loss=12.83.ckpt epoch=3-step=21404-val_detection_loss=13.05.ckpt epoch=4-step=26755-val_detection_loss=13.04.ckpt epoch=5-step=32106-val_detection_loss=11.84.ckpt epoch=6-step=37457-val_detection_loss=13.91.ckpt epoch=7-step=42808-val_detection_loss=13.88.ckpt epoch=8-step=48159-val_detection_loss=9.83.ckpt

PaulSudarshan commented 2 months ago

@PaulSudarshan I could not find any saved loss curve but I collected val loss/epoch values from the checkpoints generated during a camera-only training. epoch=0-step=5351-val_detection_loss=9.52.ckpt epoch=1-step=10702-val_detection_loss=9.89.ckpt epoch=2-step=16053-val_detection_loss=12.83.ckpt epoch=3-step=21404-val_detection_loss=13.05.ckpt epoch=4-step=26755-val_detection_loss=13.04.ckpt epoch=5-step=32106-val_detection_loss=11.84.ckpt epoch=6-step=37457-val_detection_loss=13.91.ckpt epoch=7-step=42808-val_detection_loss=13.88.ckpt epoch=8-step=48159-val_detection_loss=9.83.ckpt

Thanks. May I know the origin of the DepthNet architecture used in this repository? If you can provide some references for the architecture, that would be great @TamasMatuszka

Another question is with respect to the AIMotive dataset as per my understanding all the samples in the Aimotive catalogue (train, val) are annotated and there are no sweeps (non - annotated) data right ? In that case for the below attached snippet of LSSFPN forward function "num_sweeps" will always be equal to 1 right? and the highlighted return statement would get executed each time for train/eval/infer? image

TamasMatuszka commented 2 months ago

@PaulSudarshan, the DepthNet is originated from the BEVDepth repository. You can read about DepthNet in the paper (Section 4).

Regarding your second question: you are right, all frames are annotated in the aiMotive dataset and the "num_sweeps" will always be 1.