Resume training - Githubissues

Peterande / SAST

CVPR 2024: SAST: Scene Adaptive Sparse Transformer for Event-based Object Detection

MIT License

24 stars 2 forks source link

Resume training #2

Closed Zizzzzzzz closed 2 months ago

Zizzzzzzz commented 3 months ago

Thanks for your wonderful job! When I attempted to resume training from a checkpoint, I encountered this issue:

Peterande commented 3 months ago

That's for logging sparsity only. Try annotating that part directly or reinitializing p_loss as in detention.py:

def smooth_loss(self, step_loss, idx): if self.trainer.global_step == 0: self.iou_loss, self.conf_loss, self.cls_loss, self.p_loss = 0, 0, 0, 0

Zizzzzzzz commented 3 months ago

Thank you! I have another question regarding training resources. When I use the same dataset, the same hardware (two 3090 GPUs), and the same training settings (2 batch_size per GPU), I notice that SAST consumes about four times the gpu memory of RVT and takes six times longer to train for one epoch. What could be the reason for this? Are there significant differences between your training code and that of RVT?

Peterande commented 3 months ago

Pls check the other settings in configs. Are you making sure that the model_size (RVT-B,S,T), and dataset(Gen1/Gen4) are consistent?

Peterande commented 3 months ago

I checked my records of training RVT and SAST at the time and there was no significant difference in training duration

Peterande commented 3 months ago

One difference is that the total number of steps set has been increased from 400K to 600K, but it is probably around 40K+ to achieve the performance in the paper

Zizzzzzzz commented 3 months ago

Pls check the other settings in configs. Are you making sure that the model_size (RVT-B,S,T), and dataset(Gen1/Gen4) are consistent?

The model size is the same. I just tested on the GEN1 dataset, and indeed, the training resource consumption of SAST and RVT does not show significant differences. However, when I use my own dataset (which has sparser events), the differences mentioned above occur.

Peterande commented 3 months ago

Such a big difference is unlikely to be caused by model hyperparameters. It would be best to carefully check other global settings in debug mode, and you can also use profiling tools to examine where the GPU memory is being used.

Zizzzzzzz commented 2 months ago

Thanks, everything worked fine after I used the code I re-downloaded from github, I probably changed something by accident!