Open jiheeyang opened 2 years ago
Sorry about that!
It seems that the training is not converged. The loss is quite huge. When PA_MPJPE calculation failed
occurs, it means that the training is completely failed.
Could you please share the configuration .yml file for training? Especially the batch size you set. Did you start training from the pre-train model?
This is the configuration .yml file for training . I changed GPUS, datasets, and sample_prob in configs/v1.yml.
ARGS:
tab: 'V1_hrnet'
dataset: 'mpiinf,coco,mpii,lsp,muco,crowdpose'
GPUS: 0,1,
distributed_training: False
model_version: 1
pretrain: 'imagenet'
match_preds_to_gts_for_supervision: True
master_batch_size: -1
val_batch_size: 16
batch_size: 64
nw: 4
nw_eval: 2
lr: 0.00005
fine_tune: False
fix_backbone_training_scratch: False
eval: False
supervise_global_rot: False
model_return_loss: False
collision_aware_centermap: True
collision_factor: 0.2
homogenize_pose_space: True
shuffle_crop_mode: True
shuffle_crop_ratio_2d: 0.1
shuffle_crop_ratio_3d: 0.4
merge_smpl_camera_head: False
head_block_num: 2
backbone: 'hrnet'
centermap_size: 64
centermap_conf_thresh: 0.2
model_path: None
loss_weight:
MPJPE: 200.
PAMPJPE: 360.
P_KP2D: 400.
Pose: 80.
Shape: 6.
Prior: 1.6
CenterMap: 160.
sample_prob:
h36m: 0.0
mpiinf: 0.16
coco: 0.2
lsp: 0.16
mpii: 0.2
muco: 0.14
crowdpose: 0.14
I strongly recommand to adjust the sample_prob
.
Setting the sampling rate of different dataset is supposed to take the number of samples into accout. Please reduce the sampling rate of lsp,mpii
that contains fewer samples, 'crowdpose,coco' that contains weak annotations. The early stage of training still need accurate 3D pose dataset, which is why I develop shuffle_crop_ratio_3d
.
Thank you.
I reduced sample_prob
forlsp,mpii
. So more epochs are processed, but the same error occurs.
I modified a part of the code dataset/image_base.py.
I modified 2
to 1
because an error occurred as follows
File "/hdd1/YJH/romp_pytorch/ROMP/romp/lib/dataset/image_base.py", line 498, in test_dataset img_bsname = os.path.basename(r['imgpath'][inds]) IndexError: list index out of range
Is it related to this?
Please note that test_dataset
is only excuted when you want to test the data loading of a specific dataset. test_dataset
would be excuted when formal usage, like training, testing, or evaluation.
The batch size defined at here decides the length of list.
my_V1_train_from_scratch_hrnet_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1,2,3.log hrnet_cm64_my_V1_train_from_scratch_hrnet.yml.log my_v1.yml.log
我的从0开始的训练也是失败了,我只修改了两项 adjust_lr_factor: 1 epoch: 200,请问失败的原因是什么?请问为了避免这种情况发生,可以有哪些思路和操作呢?这种情况是否经常发生?是否正因为如此,您才提供了pretrained backbone(在2d pose数据集上学习过)?
看log应该是loss的异常导致的梯度爆炸。我这里只在之前制作预训练模型测试训练的时候出现过,重新加载中间的checkpoint继续训练就好了。会在训练早期出现这种问题,具体什么原因导致的确实没细致研究过。但使用pretrain模型,跨过基础特征构建阶段,就不会有这个问题。
我尝试重新加载中间的checkpoint继续训练,训练了10次左右,还是经常出现两个问题:一是之前提到的nan问题,二是PA_MPJPE calculation failed的问题。请问有什么更好的办法调整才能继续正常训练吗?
V1_hrnet_continue_train_from_epoch_3_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_2250nan.log V1_hrnet_continue_train_from_epoch_3_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_2400nan.log V1_hrnet_continue_train_from_epoch_3_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_2950nan.log
V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_750nan.log V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_250nan_vscode.log V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_after650_PA_MPJPE_failed.log V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_after350_PA_MPJPE_failed.log V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0.log_350nan.log
看起来其中似乎有一定的必然性的问题在里面,如果只是偶然nan的话,不太应该会在重新加载中间的checkpoint继续训练后很快重新出现nan等问题
是的,从你的log也可以看出是有一些问题。这些都是train from scratch的checkpoint的finetune吧,用pretrain不会有这个问题。实际上,我从0训练的时候是训了2D pose的heatmap和identity map的,同时学2D pose信息的时候就没有这个问题,如果您实在费劲,您可以试着用HigherHRNet的HRNet-32 pretraining开始训练,比如这个,他的那个也是训过2D pose的。排除掉其他因素的干扰,应该就是2D pose的特征对于特征构建很关键,基于这点的话,从HigherHRNet的pretraining开始训练应该不会有这个问题了。出现这个bug实在抱歉,当时开源的时候只测试了基于pretaining的训练没问题,train from scratch训太久了,赶deadline就没试。和我当时训练的不同就是2D pose的预训练了,我也会尽快实验复核这个问题!
原来如此。还有个问题就是第一次出现nan之后,应该是loss的nan导致参数梯度的nan,从而使得参数出现异常值,但为什么接下来一些step还有loss不是nan的正常值出现?
不是很理解,nan/(nan/1000.)不是还是nan吗?
Yes, you are right. Maybe we can add something like this to avoid gradient collapse:
loss_list = [0 if torch.isnan(value.item()) else value for key, value in loss_dict.items()]
loss = sum([value if value.item()<args().loss_thresh else value/(value.item()/args().loss_thresh) for value in loss_list])
Hi, I have difficulty training for the 6 dataset(mpiinf, coco, mpii, lsp, muco, crowdpose). The training code can run successfully for a period of time (not more than some epoch) and then will encounter this error in the training logs file. Can you give me a solution for this?
In 6 epoch , the Losses can be output, but the "INFO:root:Evaluation on pw3d" is nan value.
In 7 epoch, the log is " PA_MPJPE calculation failed! svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 55)".