PJLab-ADG / 3DTrans

An open-source codebase for exploring autonomous driving pre-training
https://bobrown.github.io/Team_3DTrans.github.io/
Apache License 2.0
585 stars 72 forks source link

KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer" #22

Closed RhythmOfTheRain-Byte closed 7 months ago

RhythmOfTheRain-Byte commented 10 months ago

Excuse me, when i run the command line in [Bi3D Adaptation stage 1: active source domain data ,something occured . the command line is as below bash scripts/ADA/dist_train_active_source.sh 2 --cfg_file ./cfgs/ADA/nuscenes-kitti/voxelrcnn/active_source.yaml --pretrained_model ***3DTrans/tools/cfgs/DA/nusc_kitti/source_only/voxel_rcnn_feat_3_vehi/default/ckpt/checkpoint_epoch_30.pth

Whem i do the training from beginning ,everything goes well. Howerer when i resume the training , some error occured: Traceback (most recent call last): File "train_active_source.py", line 272, in main() File "train_active_source.py", line 193, in main lr_scheduler_discriminator, lr_warmup_scheduler_discriminator = build_scheduler( File "/home/hyh/Projects/3DTrans/tools/train_utils/optimization/init.py", line 55, in build_scheduler lr_scheduler = lr_sched.LambdaLR(optimizer, lr_lbmd, last_epoch=last_epoch) File "/home/hyh/anaconda3/envs/3Dtrans/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 203, in init super(LambdaLR, self).init(optimizer, last_epoch, verbose) File "/home/hyh/anaconda3/envs/3Dtrans/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 39, in init raise KeyError("param 'initial_lr' is not specified " KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer" Anyone else knows how to solve it? Thank you!

BOBrown commented 10 months ago

@RhythmOfTheRain-Byte Hi, when you resume the training, you should check if the ckpt you loaded has saved optimizer params. Due to that saving optimizer including 'initial_lr' into the ckpt, requires more storage space, we choose not to save this information. See https://github.com/PJLab-ADG/3DTrans/blob/e90200c2dda297f4b5cdb4dae786fcf81ed006f2/tools/train_utils/train_active_source_utils.py#L631

In order to solve this issue, you could refer to https://github.com/PJLab-ADG/3DTrans/blob/e90200c2dda297f4b5cdb4dae786fcf81ed006f2/tools/train_utils/train_utils.py#L134

RhythmOfTheRain-Byte commented 9 months ago

@RhythmOfTheRain-Byte Hi, when you resume the training, you should check if the ckpt you loaded has saved optimizer params. Due to that saving optimizer including 'initial_lr' into the ckpt, requires more storage space, we choose not to save this information. See

https://github.com/PJLab-ADG/3DTrans/blob/e90200c2dda297f4b5cdb4dae786fcf81ed006f2/tools/train_utils/train_active_source_utils.py#L631

In order to solve this issue, you could refer to

https://github.com/PJLab-ADG/3DTrans/blob/e90200c2dda297f4b5cdb4dae786fcf81ed006f2/tools/train_utils/train_utils.py#L134

Thank you for your reply, But i still don't know how to fix this issue or should i just add a optimizer in checkpoint_state() as the 3DTrans/tools/train_utils/train_utils.py ? I an a newer to 3DTRANS ,can you help me ? Besides , a question about training Bi3D Adaptation stage 1: active source domain data https://github.com/PJLab-ADG/3DTrans/blob/c3bf52ff0f4f46bd898dbd3014fbb65e1714f043/tools/train_utils/train_active_source_utils.py#L518 When I set OSS=TRue, some error occurs source_list = active_learning_utils.get_dataset_list(source_file_path, oss=False) def get_target_list(target_pkl_file, oss): if oss == True: from petrel_client.client import Client ModuleNotFoundError:Nomodule named “petrel_client” Any one else know the reason?

sky-fly97 commented 7 months ago

1、Bi3D does not save the optimizer params by default, so if you want to resume the training, you need to save the optimizer params during training, e.g. save_checkpoint(checkpoint_state(model, optimizer, epoch=trained_epoch, it=accumulated_iter_detector), filename=ckpt_name) 2、The default OSS shoule be False, the purpose of this parameter is to load data from a remote server, if your data is local, do not set it to True.

RhythmOfTheRain-Byte commented 7 months ago

Thanks for your reply, I know how to solve it.