Training baseline model doesn't converge

shauloron commented 3 months ago

Hi, I'm trying to train the model with the provided config (R50 256x704) with code pulled on July 10, 2024. I'm using 4 A100 GPU's with total batch size 48.
With the original LR 6e-4 the training diverges and grad_norm goes NaN after ~20 epochs. When I lower the LR to 4e-4 the loss goes down and grad_norm is ok but the final model has AP=0 for all classes after 100 epochs. I read through all the open and closed issue. Checked that the resnet50 pretrain is loading successfully. using the default aug_config: data_aug_conf = { "resize_lim": (0.40, 0.47), "final_dim": input_shape[::-1], "bot_pct_lim": (0.0, 0.0), "rot_lim": (-5.4, 5.4), "H": 900, "W": 1600, "rand_flip": True, "rot3d_range": [-0.3925, 0.3925], }

Any ideas why my training doesn't converge?

linxuewu commented 3 months ago

Can you obtain correct results (metrics and visualizations) by running inference with the released checkpoint?

shauloron commented 3 months ago

I can't seem to find any pretrained model, only pretrained resnet50 weights. Can you please point me to the pretrained weights you released? Thanks

shauloron commented 3 months ago

ok found it now: ckpt I'll check how it works out.

shauloron commented 3 months ago

I've tried using the checkpoint above and get different results than the ones in the tutorial notebook visualizations To start the anchor map doesn't look the same: and the detection results are partial:

any ideas?

shauloron commented 3 months ago

Hi @linxuewu I would really appreciate any idea you may have on this one?

HorizonRobotics / Sparse4D

Training baseline model doesn't converge #75