Closed RhythmOfTheRain-Byte closed 7 months ago
@RhythmOfTheRain-Byte Hi, when you resume the training, you should check if the ckpt you loaded has saved optimizer params. Due to that saving optimizer including 'initial_lr' into the ckpt, requires more storage space, we choose not to save this information. See https://github.com/PJLab-ADG/3DTrans/blob/e90200c2dda297f4b5cdb4dae786fcf81ed006f2/tools/train_utils/train_active_source_utils.py#L631
In order to solve this issue, you could refer to https://github.com/PJLab-ADG/3DTrans/blob/e90200c2dda297f4b5cdb4dae786fcf81ed006f2/tools/train_utils/train_utils.py#L134
@RhythmOfTheRain-Byte Hi, when you resume the training, you should check if the ckpt you loaded has saved optimizer params. Due to that saving optimizer including 'initial_lr' into the ckpt, requires more storage space, we choose not to save this information. See
In order to solve this issue, you could refer to
Thank you for your reply, But i still don't know how to fix this issue or should i just add a optimizer in checkpoint_state() as the 3DTrans/tools/train_utils/train_utils.py ? I an a newer to 3DTRANS ,can you help me ? Besides , a question about training Bi3D Adaptation stage 1: active source domain data https://github.com/PJLab-ADG/3DTrans/blob/c3bf52ff0f4f46bd898dbd3014fbb65e1714f043/tools/train_utils/train_active_source_utils.py#L518 When I set OSS=TRue, some error occurs source_list = active_learning_utils.get_dataset_list(source_file_path, oss=False) def get_target_list(target_pkl_file, oss): if oss == True: from petrel_client.client import Client ModuleNotFoundError:Nomodule named “petrel_client” Any one else know the reason?
1、Bi3D does not save the optimizer params by default, so if you want to resume the training, you need to save the optimizer params during training,
e.g. save_checkpoint(checkpoint_state(model, optimizer, epoch=trained_epoch, it=accumulated_iter_detector), filename=ckpt_name)
2、The default OSS shoule be False, the purpose of this parameter is to load data from a remote server, if your data is local, do not set it to True.
Thanks for your reply, I know how to solve it.
Excuse me, when i run the command line in [Bi3D Adaptation stage 1: active source domain data ,something occured . the command line is as below bash scripts/ADA/dist_train_active_source.sh 2 --cfg_file ./cfgs/ADA/nuscenes-kitti/voxelrcnn/active_source.yaml --pretrained_model ***3DTrans/tools/cfgs/DA/nusc_kitti/source_only/voxel_rcnn_feat_3_vehi/default/ckpt/checkpoint_epoch_30.pth
Whem i do the training from beginning ,everything goes well. Howerer when i resume the training , some error occured: Traceback (most recent call last): File "train_active_source.py", line 272, in
main()
File "train_active_source.py", line 193, in main
lr_scheduler_discriminator, lr_warmup_scheduler_discriminator = build_scheduler(
File "/home/hyh/Projects/3DTrans/tools/train_utils/optimization/init.py", line 55, in build_scheduler
lr_scheduler = lr_sched.LambdaLR(optimizer, lr_lbmd, last_epoch=last_epoch)
File "/home/hyh/anaconda3/envs/3Dtrans/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 203, in init
super(LambdaLR, self).init(optimizer, last_epoch, verbose)
File "/home/hyh/anaconda3/envs/3Dtrans/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 39, in init
raise KeyError("param 'initial_lr' is not specified "
KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer"
Anyone else knows how to solve it?
Thank you!