PA_MPJPE calculation failed! svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 55).

Arthur151 / ROMP

Monocular, One-stage, Regression of Multiple 3D People and their 3D positions & trajectories in camera & global coordinates. ROMP[ICCV21], BEV[CVPR22], TRACE[CVPR2023]

https://www.yusun.work/

Apache License 2.0

1.36k stars 231 forks source link

PA_MPJPE calculation failed! svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 55). #136

Open jiheeyang opened 2 years ago

jiheeyang commented 2 years ago

Hi, I have difficulty training for the 6 dataset(mpiinf, coco, mpii, lsp, muco, crowdpose). The training code can run successfully for a period of time (not more than some epoch) and then will encounter this error in the training logs file. Can you give me a solution for this?

6 epoch

In 6 epoch , the Losses can be output, but the "INFO:root:Evaluation on pw3d" is nan value.

7 epoch

In 7 epoch, the log is " PA_MPJPE calculation failed! svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 55)".

Arthur151 commented 2 years ago

Sorry about that! It seems that the training is not converged. The loss is quite huge. When PA_MPJPE calculation failed occurs, it means that the training is completely failed.

Could you please share the configuration .yml file for training? Especially the batch size you set. Did you start training from the pre-train model?

jiheeyang commented 2 years ago

This is the configuration .yml file for training . I changed GPUS, datasets, and sample_prob in configs/v1.yml.

ARGS:
 tab: 'V1_hrnet' 
 dataset: 'mpiinf,coco,mpii,lsp,muco,crowdpose'
 GPUS: 0,1,
 distributed_training: False
 model_version: 1
 pretrain: 'imagenet'
 match_preds_to_gts_for_supervision: True

 master_batch_size: -1
 val_batch_size: 16
 batch_size: 64
 nw: 4
 nw_eval: 2
 lr: 0.00005

 fine_tune: False
 fix_backbone_training_scratch: False
 eval: False
 supervise_global_rot: False

 model_return_loss: False
 collision_aware_centermap: True
 collision_factor: 0.2
 homogenize_pose_space: True
 shuffle_crop_mode: True
 shuffle_crop_ratio_2d: 0.1
 shuffle_crop_ratio_3d: 0.4

 merge_smpl_camera_head: False
 head_block_num: 2

 backbone: 'hrnet'
 centermap_size: 64
 centermap_conf_thresh: 0.2

 model_path: None

loss_weight:
  MPJPE: 200.
  PAMPJPE: 360.
  P_KP2D: 400.
  Pose: 80.
  Shape: 6.
  Prior: 1.6
  CenterMap: 160.

sample_prob:
 h36m: 0.0
 mpiinf: 0.16
 coco: 0.2
 lsp: 0.16
 mpii: 0.2
 muco: 0.14
 crowdpose: 0.14

Arthur151 commented 2 years ago

I strongly recommand to adjust the sample_prob. Setting the sampling rate of different dataset is supposed to take the number of samples into accout. Please reduce the sampling rate of lsp,mpii that contains fewer samples, 'crowdpose,coco' that contains weak annotations. The early stage of training still need accurate 3D pose dataset, which is why I develop shuffle_crop_ratio_3d.

jiheeyang commented 2 years ago

Thank you. I reduced sample_probforlsp,mpii. So more epochs are processed, but the same error occurs. I modified a part of the code dataset/image_base.py.
I modified 2 to 1 because an error occurred as follows File "/hdd1/YJH/romp_pytorch/ROMP/romp/lib/dataset/image_base.py", line 498, in test_dataset img_bsname = os.path.basename(r['imgpath'][inds]) IndexError: list index out of range

Is it related to this?

Arthur151 commented 2 years ago

Please note that test_dataset is only excuted when you want to test the data loading of a specific dataset. test_dataset would be excuted when formal usage, like training, testing, or evaluation. The batch size defined at here decides the length of list.

liangwx commented 2 years ago

my_V1_train_from_scratch_hrnet_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1,2,3.log hrnet_cm64_my_V1_train_from_scratch_hrnet.yml.log my_v1.yml.log

我的从0开始的训练也是失败了，我只修改了两项 adjust_lr_factor: 1 epoch: 200，请问失败的原因是什么？请问为了避免这种情况发生，可以有哪些思路和操作呢？这种情况是否经常发生？是否正因为如此，您才提供了pretrained backbone（在2d pose数据集上学习过）？

Arthur151 commented 2 years ago

看log应该是loss的异常导致的梯度爆炸。我这里只在之前制作预训练模型测试训练的时候出现过，重新加载中间的checkpoint继续训练就好了。会在训练早期出现这种问题，具体什么原因导致的确实没细致研究过。但使用pretrain模型，跨过基础特征构建阶段，就不会有这个问题。

liangwx commented 2 years ago

我尝试重新加载中间的checkpoint继续训练，训练了10次左右，还是经常出现两个问题：一是之前提到的nan问题，二是PA_MPJPE calculation failed的问题。请问有什么更好的办法调整才能继续正常训练吗？

V1_hrnet_continue_train_from_epoch_3_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_2250nan.log V1_hrnet_continue_train_from_epoch_3_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_2400nan.log V1_hrnet_continue_train_from_epoch_3_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_2950nan.log

V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_750nan.log V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_250nan_vscode.log V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_after650_PA_MPJPE_failed.log V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0,1.log_after350_PA_MPJPE_failed.log V1_hrnet_continue_train_from_epoch4_h36m,mpiinf,coco,mpii,lsp,muco,crowdpose_g0.log_350nan.log

liangwx commented 2 years ago

看起来其中似乎有一定的必然性的问题在里面，如果只是偶然nan的话，不太应该会在重新加载中间的checkpoint继续训练后很快重新出现nan等问题

Arthur151 commented 2 years ago

是的，从你的log也可以看出是有一些问题。这些都是train from scratch的checkpoint的finetune吧，用pretrain不会有这个问题。实际上，我从0训练的时候是训了2D pose的heatmap和identity map的，同时学2D pose信息的时候就没有这个问题，如果您实在费劲，您可以试着用HigherHRNet的HRNet-32 pretraining开始训练，比如这个，他的那个也是训过2D pose的。排除掉其他因素的干扰，应该就是2D pose的特征对于特征构建很关键，基于这点的话，从HigherHRNet的pretraining开始训练应该不会有这个问题了。出现这个bug实在抱歉，当时开源的时候只测试了基于pretaining的训练没问题，train from scratch训太久了，赶deadline就没试。和我当时训练的不同就是2D pose的预训练了，我也会尽快实验复核这个问题！

liangwx commented 2 years ago

原来如此。还有个问题就是第一次出现nan之后，应该是loss的nan导致参数梯度的nan，从而使得参数出现异常值，但为什么接下来一些step还有loss不是nan的正常值出现？

Arthur151 commented 2 years ago

I guess that it might be this line.

liangwx commented 2 years ago

不是很理解，nan/(nan/1000.)不是还是nan吗？

Arthur151 commented 2 years ago

Yes, you are right. Maybe we can add something like this to avoid gradient collapse:

loss_list = [0 if torch.isnan(value.item()) else value for key, value in loss_dict.items()]
loss = sum([value if value.item()<args().loss_thresh else value/(value.item()/args().loss_thresh) for value in loss_list])